Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel cause of entropy collapse in verifiable reward reinforcement learning (RLVR): the clipping mechanisms in PPO and GRPO inherently induce entropy bias—clip-low increases entropy and encourages exploration, whereas clip-high suppresses entropy and accelerates convergence. Under standard hyperparameters, clip-high dominates, leading to persistent entropy decay—even under random rewards—introducing a reward-agnostic confounding factor. Method: To counteract premature convergence, the authors propose actively amplifying clip-low to regulate policy entropy. Contribution/Results: Theoretical analysis and empirical evaluation demonstrate that this intervention substantially mitigates entropy collapse, improves long-horizon reasoning stability, and enhances generalization. It offers a principled, actionable mechanism for balancing exploration and exploitation in large language model (LLM) reinforcement learning—advancing both interpretability and controllability of policy optimization dynamics.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has recently emerged as the leading approach for enhancing the reasoning capabilities of large language models (LLMs). However, RLVR is prone to entropy collapse, where the LLM quickly converges to a near-deterministic form, hindering exploration and progress during prolonged RL training. In this work, we reveal that the clipping mechanism in PPO and GRPO induces biases on entropy. Through theoretical and empirical analyses, we show that clip-low increases entropy, while clip-high decreases it. Further, under standard clipping parameters, the effect of clip-high dominates, resulting in an overall entropy reduction even when purely random rewards are provided to the RL algorithm. Our findings highlight an overlooked confounding factor in RLVR: independent of the reward signal, the clipping mechanism influences entropy, which in turn affects the reasoning behavior. Furthermore, our analysis demonstrates that clipping can be deliberately used to control entropy. Specifically, with a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.
Problem

Research questions and friction points this paper is trying to address.

Clipping mechanisms in RL algorithms bias entropy dynamics
Standard clipping parameters cause overall entropy reduction in training
Entropy collapse hinders exploration during prolonged RL training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clip-low increases entropy to promote exploration
Clip-high decreases entropy reducing model diversity
Adjusting clipping parameters controls entropy in RLVR
🔎 Similar Papers
No similar papers found.
Jaesung R. Park
Jaesung R. Park
Department of Mathematics, UCLA
J
Junsu Kim
Department of Mathematical Sciences, Seoul National University
G
Gyeongman Kim
KRAFTON
J
Jinyoung Jo
Department of Linguistics, Stanford University
S
Sean Choi
Department of Computer Science and Engineering, Santa Clara University
Jaewoong Cho
Jaewoong Cho
KRAFTON AI
Machine LearningInformation Theory
Ernest K. Ryu
Ernest K. Ryu
University of California, Los Angeles
Deep learning theoryConvex optimization