DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large reasoning models (LRMs) enhanced by reinforcement learning algorithms (e.g., GRPO) exhibit improved complex reasoning but suffer from “overthinking”—generating unnecessarily lengthy rationales for simple tasks, increasing computational cost and latency. Existing length reward methods employ global normalization, erroneously assigning negative advantages to correct yet longer rationales, thereby degrading performance. Method: We propose Decoupled Reward Strategy Optimization (DRSO): (i) decoupling length reward signals for correct versus incorrect reasoning paths; (ii) normalizing the advantage function only within the positive (correct) sample group; (iii) incorporating KL regularization to model the optimal positive-sample distribution; and (iv) leveraging on-policy data efficiently via importance-weighted and discriminative objectives. Contribution/Results: On mathematical reasoning benchmarks, DRSO reduces average reasoning length by 77% for a 1.5B-parameter model while incurring only a 1.1% accuracy drop—outperforming six state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long and redundant reasoning even for simple questions, which substantially increases computational cost and response latency. While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning. To overcome this, we propose Decoupled Reward Policy Optimization (DRPO), a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO ensures that reward signals for correct rollouts are normalized solely within the positive group, shielding them from interference by negative samples. The DRPO's objective is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution, enabling efficient computation of the objective and its gradients using only on-policy data and importance weighting. Of independent interest, this formulation is general and can incorporate other preference rewards of positive data beyond length. Experiments on mathematical reasoning tasks demonstrate DRPO's significant superiority over six efficient reasoning baselines. Notably, with a 1.5B model, our method achieves 77% length reduction with only 1.1% performance loss on simple questions like GSM8k dataset, while the follow-up baseline sacrifices 4.3% for 68% length reduction.

Problem

Research questions and friction points this paper is trying to address.

Addresses overthinking in large reasoning models

Reduces computational cost of lengthy reasoning

Maintains performance while shortening reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples length rewards between correct and incorrect reasoning

Uses positive group normalization to protect valid reasoning

Integrates optimized positive data distribution with KL regularization

🔎 Similar Papers

No similar papers found.

Authors to Follow