CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Policy gradient methods like PPO suffer from gradient vanishing for low-probability tokens due to action clipping, leading to entropy collapse and impaired exploration-exploitation trade-offs. Method: We propose the Gradient Retention Mechanism (GRM), which (i) first identifies the critical role of clipped tokens in entropy dynamics, (ii) introduces a bounded gradient backpropagation strategy that probabilistically preserves gradient signals outside the clipping interval, and (iii) integrates theoretical analysis with dynamic entropy regulation—without altering PPO’s core update logic. Results: GRM demonstrates strong compatibility and consistently outperforms PPO and its variants across multiple mathematical reasoning benchmarks. It effectively mitigates entropy collapse, enhances training stability for models of varying scales, and improves final task performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose extbf{C}ontrolling extbf{E}ntropy via extbf{G}radient- extbf{P}reserving extbf{P}olicy extbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

Problem

Research questions and friction points this paper is trying to address.

Managing policy entropy balance between exploration and exploitation in RL

Preserving gradient signals from low-probability tokens discarded by clipping

Mitigating entropy instability in reinforcement learning for language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reintroduces clipped token gradients gently

Controls gradient magnitude outside clipping interval

Achieves exploration-exploitation trade-off via entropy regulation

🔎 Similar Papers

No similar papers found.