DCPO: Dynamic Clipping Policy Optimization

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the zero-gradient problem in RLVR frameworks—caused by fixed token-level clipping and reward normalization in methods like GRPO—this paper proposes a novel paradigm of dynamic clipping and smoothed advantage normalization. Methodologically, it introduces: (1) a dynamic clipping threshold derived from token-level prior probabilities, reducing clipping frequency by an order of magnitude; and (2) cumulative advantage normalization across training steps combined with response-level reward processing, increasing the proportion of non-zero advantages by 28%. These innovations effectively mitigate gradient vanishing, enhancing both response utilization and exploration capability. Evaluated on four standard benchmarks, the method achieves state-of-the-art performance, attaining an AIME24 Avg@1 score of 46.7—substantially surpassing prior baselines. Moreover, it doubles training efficiency without compromising convergence or stability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.
Problem

Research questions and friction points this paper is trying to address.

Addresses zero gradients in RLVR from fixed clipping bounds
Solves ineffective gradient updates due to identical reward standardization
Improves response utilization and token-level exploration in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic clipping strategy adaptively adjusts token bounds
Smooth advantage standardization improves response utilization
Enhances token-level exploration with prior probabilities
🔎 Similar Papers
No similar papers found.
S
Shihui Yang
Baichuan.inc
C
Chengfeng Dou
Baichuan.inc
P
Peidong Guo
Baichuan.inc
K
Kai Lu
Baichuan.inc
Q
Qiang Ju
Baichuan.inc
Fei Deng
Fei Deng
Research Scientist, Google
Diffusion ModelsRLHFReinforcement LearningGenerative ModelsObject-Centric Learning
R
Rihui Xin
Baichuan.inc