Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of excessive reasoning in large reasoning models on simple tasks, which often stems from a lack of difficulty awareness and leads to redundant computation and resource inefficiency. To mitigate this, the authors propose the Difficulty-aware Policy Optimization (DiPO) framework, which leverages reinforcement learning to enable models to autonomously estimate task complexity and dynamically adjust both reasoning depth and generation strategy. DiPO introduces a novel self-reasoning-based difficulty modeling mechanism that minimizes reliance on human-annotated difficulty labels and incorporates a reward function explicitly integrating difficulty signals to balance reasoning length against performance. Experimental results demonstrate that DiPO significantly reduces redundant token generation without compromising task accuracy, thereby achieving adaptive control over inference costs.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.
Problem

Research questions and friction points this paper is trying to address.

overthinking
Large Reasoning Models
task difficulty awareness
reasoning efficiency
resource allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Difficulty-aware Reinforcement Learning
Large Reasoning Models
Overthinking Mitigation
Task Complexity Modeling
Policy Optimization
🔎 Similar Papers
No similar papers found.
Qian Wan
Qian Wan
Central China Normal University
Natural Language ProcessingInformation ExtractionLarge language model
Z
Ziao Xu
National Engineering Research Center of Educational Big Data and the Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China
L
Luona Wei
College of Electronics and Information Engineering, South-Central Minzu University, Wuhan 430074, China
X
Xiaoxuan Shen
Laboratory for Artificial Intelligence and New Forms of Education, the National Engineering Research Center of Educational Big Data, and the Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China
Jianwen Sun
Jianwen Sun
Software Engineering Application Technology Lab, Huawei, China
Software engineeringDeep reinforcement learning