From $\boldsymbol{\logπ}$ to $\boldsymbolπ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the instability and exploration challenges in reinforcement learning caused by soft clipping methods, which rely on log-probability gradients and often suffer from weight divergence. To overcome this, we propose Decoupled Gradient Policy Optimization (DGPO), the first approach to replace log-probability gradients with probability gradients. DGPO introduces a novel bilateral decoupled decay mechanism based on importance sampling ratios, applying asymmetric continuous decay specifically to boundary tokens. This design effectively mitigates gradient divergence while preserving exploration capability. Extensive experiments demonstrate that DGPO consistently outperforms existing baselines across the DeepSeek-R1-Distill-Qwen model series (1.5B/7B/14B) and achieves stable, scalable performance gains on multiple mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_θ\log π_θ$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_θπ_θ$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose-Juri/DGPO-RL.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

soft clipping

gradient divergence

Large Language Models

policy gradient

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Gradient Policy Optimization

probability gradient

soft clipping