Reasoning Models Reason Well, Until They Don't

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks lack sufficient complexity to rigorously evaluate the out-of-distribution generalization capability of large reasoning models (LRMs), obscuring their fundamental limitations when confronted with real-world knowledge graph structures. Method: To address this, the authors introduce DeepRD—a scalable, infinitely extensible dataset—and propose a fine-grained, controllable generation paradigm grounded in graph connectivity and natural-language proof planning. This enables systematic, multi-dimensional evaluation of LRM reasoning behavior beyond training distribution complexity. Contribution/Results: Experiments reveal that while LRMs excel at moderate complexity, their performance collapses catastrophically at high complexity, exposing an intrinsic gap between current LRM capabilities and the long-tailed structural complexity inherent in real-world knowledge graphs. This work establishes a novel benchmark and evaluation paradigm for rigorous assessment of reasoning capacity.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning model performance on scalable complexity benchmarks
Analyzing catastrophic failure of models when complexity exceeds thresholds
Identifying limitations in generalization despite near-term practical utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs with step-by-step argumentation incentives
Developed DeepRD dataset for scalable complexity evaluation
Analyzed LRM performance boundaries using real-world graph distributions
🔎 Similar Papers
No similar papers found.
R
Revanth Rameshkumar
University of Washington
J
Jimson Huang
Purdue University
Y
Yunxin Sun
Purdue University
F
Fei Xia
University of Washington
Abulhair Saparov
Abulhair Saparov
Assistant Professor, Purdue University
Natural Language UnderstandingReasoningNatural Language ProcessingStatistical Machine Learning