Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of "overthinking" in large language models during reasoning, where excessively long chains of thought (CoT) increase computational costs and degrade performance. The authors propose a multi-stage efficient reasoning framework that integrates supervised fine-tuning with reinforcement learning, featuring an adaptive length penalty mechanism. This mechanism selectively encourages self-verification only when beneficial and suppresses redundant output once a correct answer is first generated. Fine-tuning is performed via rejection sampling or reformatting of reasoning trajectories. Evaluated on 8B and 32B models, the method reduces response length by 28% and 40%, respectively, with only minor accuracy drops of 1.6 and 2.5 percentage points, achieving an AUC_OAA of 76.6—substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.
Problem

Research questions and friction points this paper is trying to address.

overthinking
chain-of-thought
reasoning efficiency
large language models
response length
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-stage training
adaptive reasoning
chain-of-thought compression
overthinking mitigation
length-penalized reinforcement learning
🔎 Similar Papers
No similar papers found.