Claude dominates the BullshitBench v2 leaderboard. This benchmark tests whether models detect and reject nonsense requests or just accept them blindly. Top 7 spots — all Claude: Claude Sonnet 4.6 (High) — 91% detected Claude Opus 4.5 (High) — 90% Claude Sonnet 4.6 — 89% Claude Show more
