🤖 AI Summary
Large reasoning models (LRMs) often exhibit “overthinking” on simple problems, resulting in computational redundancy and reduced inference efficiency. Existing efficient reasoning approaches rely on predefined token budgets or task-difficulty estimators, limiting their generalizability and robustness. To address this, we propose the Verifiable Stepwise Reward Mechanism (VSRM), a rule-driven framework that dynamically evaluates the validity of each intermediate reasoning step based on its verifiability. Integrated within PPO and Reinforce++ optimization, VSRM enables automatic pruning of invalid reasoning steps without requiring fixed token budgets. Evaluated on mathematical reasoning benchmarks—including AIME24 and AIME25—VSRM significantly reduces output length and over-reasoning frequency while preserving or even improving pass@k accuracy. This achieves a synergistic optimization of inference efficiency and solution correctness.
📝 Abstract
Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning with verifiable rewards. However, LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency. Existing efficient reasoning methods typically require accurate task assessment to preset token budgets or select reasoning modes, which limits their flexibility and reliability. In this work, we revisit the essence of overthinking and identify that encouraging effective steps while penalizing ineffective ones is key to its solution. To this end, we propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory. This approach is intuitive and naturally fits the step-by-step nature of reasoning tasks. We conduct extensive experiments on standard mathematical reasoning benchmarks, including AIME24 and AIME25, by integrating VSRM with PPO and Reinforce++. Results show that our method achieves substantial output length reduction while maintaining original reasoning performance, striking an optimal balance between efficiency and accuracy. Further analysis of overthinking frequency and pass@k score before and after training demonstrates that our approach in deed effectively suppresses ineffective steps and encourages effective reasoning, fundamentally alleviating the overthinking problem. All code will be released upon acceptance.