BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Although large language models excel on standard benchmarks, they frequently fail on commonsense reasoning problems that humans find trivial, such as riddles. To address this, this work introduces BrainBench, a benchmark comprising 100 questions spanning 20 distinct failure modes in commonsense reasoning, which systematically reveals models’ reliance on superficial heuristics rather than genuine reasoning. The study employs zero-shot evaluation, multiple independent runs, bilingual (Chinese–English) validation, and consistency analysis to provide a fine-grained diagnostic framework. Experimental results show that even the best-performing model (Claude Opus 4.6) achieves only 80.3% accuracy, while the worst scores 39.7%. All models exhibit a 6–16 percentage point gap between accuracy and consistency, and performance consistently degrades by 2–8 percentage points in Chinese compared to English.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

Problem

Research questions and friction points this paper is trying to address.

commonsense reasoning

large language models

reasoning gap

brainteaser

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

commonsense reasoning

benchmark

large language models