FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation often caused by multi-solution generation in large reasoning models at test time. The study identifies a “first-solution optimality” phenomenon and reveals that errors propagate in a forest-like structure—termed Forest of Errors (FoE)—as reasoning unfolds. To mitigate this, the authors propose the RED framework, which integrates a Refining First module to enhance the initial solution and a Discarding Subs module to prune erroneous reasoning paths, supported by dual consistency checks and explicit error-structure modeling. RED challenges prevailing test-time scaling laws by establishing a self-guided, efficient reasoning paradigm. Evaluated across five benchmarks and six backbone models, RED outperforms eight strong baselines, achieving up to a 19.0% performance gain while reducing token consumption by 37.7%–70.4%.
📝 Abstract
Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Large Reasoning Models
reasoning errors
test-time scaling
alternative solutions
error propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Forest of Errors
The First is The Best
self-guided reasoning
test-time scaling
error pruning
🔎 Similar Papers
No similar papers found.
K
Kehan Jiang
School of Software and Microelectronics, Peking University
H
Haonan Dong
State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University
Z
Zhaolu Kang
School of Software and Microelectronics, Peking University
Z
Zhengzhou Zhu
School of Software and Microelectronics, Peking University
Guojie Song
Guojie Song
Professor (Research), Tenured of Peking University
Psychological AIAI Safe & Value AlignmentAgent Cognition & Behavioral ModelingLLM&GML