🤖 AI Summary
Although large language models excel on standard benchmarks, they frequently fail on commonsense reasoning problems that humans find trivial, such as riddles. To address this, this work introduces BrainBench, a benchmark comprising 100 questions spanning 20 distinct failure modes in commonsense reasoning, which systematically reveals models’ reliance on superficial heuristics rather than genuine reasoning. The study employs zero-shot evaluation, multiple independent runs, bilingual (Chinese–English) validation, and consistency analysis to provide a fine-grained diagnostic framework. Experimental results show that even the best-performing model (Claude Opus 4.6) achieves only 80.3% accuracy, while the worst scores 39.7%. All models exhibit a 6–16 percentage point gap between accuracy and consistency, and performance consistently degrades by 2–8 percentage points in Chinese compared to English.
📝 Abstract
Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.