🤖 AI Summary
This work addresses the issue of excessive reasoning in large reasoning models on simple tasks, which often stems from a lack of difficulty awareness and leads to redundant computation and resource inefficiency. To mitigate this, the authors propose the Difficulty-aware Policy Optimization (DiPO) framework, which leverages reinforcement learning to enable models to autonomously estimate task complexity and dynamically adjust both reasoning depth and generation strategy. DiPO introduces a novel self-reasoning-based difficulty modeling mechanism that minimizes reliance on human-annotated difficulty labels and incorporates a reward function explicitly integrating difficulty signals to balance reasoning length against performance. Experimental results demonstrate that DiPO significantly reduces redundant token generation without compromising task accuracy, thereby achieving adaptive control over inference costs.
📝 Abstract
Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.