Evaluating the Logical Reasoning Abilities of Large Reasoning Models

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing benchmarks inadequately assess foundational logical reasoning—specifically deductive, inductive, analogical, and abductive reasoning—without reliance on domain-specific knowledge. Method: We introduce LogiEval, the first comprehensive benchmark for pure logic evaluation, curated from high-quality human exams (e.g., LSAT, GMAT) and spanning multiple input formats; we further propose LogiEval-Hard, a challenging subset identified via failure prediction using a small model (Qwen3-30B-A3B) to expose logical bottlenecks in large models. We advocate a two-stage “small-model screening + large-model generalization” evaluation paradigm. Contribution/Results: Experiments reveal that state-of-the-art models achieve superhuman performance on analogical reasoning and four-option argument analysis, yet consistently underperform on LogiEval-Hard—uncovering a fundamental, scale-invariant limitation in logical reasoning capability across modern LLMs.

Technology Category

Application Category

📝 Abstract

Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating logical reasoning abilities in large reasoning models

Assessing model performance across diverse reasoning types and formats

Identifying persistent reasoning bottlenecks across different model scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LogiEval benchmark for logical reasoning

Uses diverse reasoning types and task formats

Creates LogiEval-Hard subset for challenging diagnostics

🔎 Similar Papers

No similar papers found.