🤖 AI Summary
Existing benchmarks inadequately assess foundational logical reasoning—specifically deductive, inductive, analogical, and abductive reasoning—without reliance on domain-specific knowledge. Method: We introduce LogiEval, the first comprehensive benchmark for pure logic evaluation, curated from high-quality human exams (e.g., LSAT, GMAT) and spanning multiple input formats; we further propose LogiEval-Hard, a challenging subset identified via failure prediction using a small model (Qwen3-30B-A3B) to expose logical bottlenecks in large models. We advocate a two-stage “small-model screening + large-model generalization” evaluation paradigm. Contribution/Results: Experiments reveal that state-of-the-art models achieve superhuman performance on analogical reasoning and four-option argument analysis, yet consistently underperform on LogiEval-Hard—uncovering a fundamental, scale-invariant limitation in logical reasoning capability across modern LLMs.
📝 Abstract
Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.