🤖 AI Summary
This work addresses the tendency of large reasoning models to overthink simple problems and the challenge of simultaneously achieving accuracy, efficiency, and robustness across heterogeneous reasoning behaviors. The authors propose a two-stage training framework: first, mixed fine-tuning enables the model to learn both with- and without-reasoning behaviors for effective initialization; second, Correctness-Preserving Advantage Shaping (CPAS) prevents the suppression of correct long-chain reasoning, while Length-Aware Gradient Regulation (LAGR) enables adaptive reinforcement learning. Evaluated on Qwen2.5-1.5B and Qwen2.5-7B, the approach improves accuracy by up to 3.7 and 3.6 percentage points, respectively, reduces generation length by 40.6% and 43.9%, and demonstrates strong generalization and stability on out-of-distribution and multi-difficulty tasks.
📝 Abstract
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.