Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of large reasoning models to overthink simple problems and the challenge of simultaneously achieving accuracy, efficiency, and robustness across heterogeneous reasoning behaviors. The authors propose a two-stage training framework: first, mixed fine-tuning enables the model to learn both with- and without-reasoning behaviors for effective initialization; second, Correctness-Preserving Advantage Shaping (CPAS) prevents the suppression of correct long-chain reasoning, while Length-Aware Gradient Regulation (LAGR) enables adaptive reinforcement learning. Evaluated on Qwen2.5-1.5B and Qwen2.5-7B, the approach improves accuracy by up to 3.7 and 3.6 percentage points, respectively, reduces generation length by 40.6% and 43.9%, and demonstrates strong generalization and stability on out-of-distribution and multi-difficulty tasks.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
Problem

Research questions and friction points this paper is trying to address.

overthinking
reasoning efficiency
accuracy-efficiency trade-off
reasoning heterogeneity
large reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Reasoning
Advantage Shaping
Gradient Regulation
Large Reasoning Models
Efficiency-Accuracy Trade-off
🔎 Similar Papers
No similar papers found.
Zihang Xu
Zihang Xu
The University of Hong Kong
artificial intelligencemedical information analysis
H
Haozhi Xie
Beihang University, Shanghai AI Laboratory, Beijing University of Posts and Telecommunications, Renmin University of China
Z
Ziqi Miao
Beihang University, Shanghai AI Laboratory, Beijing University of Posts and Telecommunications, Renmin University of China
W
Wuxuan Gong
Beihang University, Shanghai AI Laboratory, Beijing University of Posts and Telecommunications, Renmin University of China
Chen Qian
Chen Qian
Renmin University of China
Large Language ModelsSafetyInterpretabilityGraph Neural Networks
Lijun Li
Lijun Li
Shanghai AI Lab
Computer visionLLM safety