Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Large reasoning models suffer from “overthinking” in reward-free reinforcement learning (RLVR): they generate excessively long reasoning traces without performance gains, while conventional trajectory-level length penalties fail due to misalignment with token-level optimization. To address this, we propose DECS—a critic-free framework grounded entirely in verifiable rewards. Its core contributions are: (1) a novel token-level decoupled reward mechanism that precisely identifies and penalizes redundant reasoning tokens; and (2) a curriculum-based batch scheduling strategy that dynamically balances exploration efficiency and reasoning quality. Evaluated across seven benchmarks, DECS reduces average reasoning tokens by over 50% while preserving or improving task accuracy—achieving, for the first time, joint optimization of inference efficiency and reasoning capability.

Technology Category

Application Category

📝 Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Problem

Research questions and friction points this paper is trying to address.

Reduces overthinking by penalizing only redundant reasoning tokens

Addresses misalignment between trajectory rewards and token optimization

Maintains model performance while cutting reasoning length by 50%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled token-level reward penalizes redundant tokens

Curriculum batch scheduling balances efficiency and efficacy

Reduces reasoning tokens by over 50% while maintaining performance

🔎 Similar Papers

Mitigating the Stability-Plasticity Dilemma in Adaptive Train Scheduling with Curriculum-Driven Continual DQN Expansion