The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the prevalent early entropy collapse problem in reinforcement learning (RL) for large language models (LLMs)—a rapid decline in policy entropy $H$ that impairs exploration and leads to performance saturation. We first establish a quantitative, analytical negative correlation between policy entropy and downstream reward: $R = -a e^H + b$. We further identify that entropy decay fundamentally stems from covariance mismatch between action probabilities and logits gradients. To rectify this, we propose two covariance-driven entropy regularization paradigms: Clip-Cov (covariance-aware logits clipping) and KL-Cov (covariance-constrained KL regularization). Experiments demonstrate that our methods effectively suppress entropy collapse, extend the effective exploration horizon, and achieve breakthrough performance on multi-step reasoning tasks. Our work establishes entropy dynamics management as a critical pathway to enhancing the scalability and stability of RL training for LLMs.

Technology Category

Application Category

📝 Abstract
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
Problem

Research questions and friction points this paper is trying to address.

Prevent policy entropy collapse in RL for reasoning LLMs
Establish entropy-performance trade-off equation R=-a*e^H+b
Control entropy via high-covariance token update restriction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Establishes entropy-performance equation R=-a*e^H+b
Proposes Clip-Cov and KL-Cov techniques
Controls entropy via high-covariance tokens
🔎 Similar Papers
No similar papers found.
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
Y
Yuchen Zhang
Shanghai AI Laboratory, Peking University
J
Jiacheng Chen
Shanghai AI Laboratory
Lifan Yuan
Lifan Yuan
University of Illinois Urbana-Champaign
Natural Language ProcessingMachine Learning
Z
Zhi Wang
Nanjing University
Y
Yuxin Zuo
Tsinghua University
Haozhan Li
Haozhan Li
Tsinghua University
LLM RLVLA RL
Yuchen Fan
Yuchen Fan
Shanghai AI Laboratory & Shanghai Jiao Tong University
NLPLarge Language ModelsEvaluation
Huayu Chen
Huayu Chen
Tsinghua University
Reinforcement LearningDeep Generative ModelsMachine Learning
Weize Chen
Weize Chen
Tsinghua University
NLPML
Z
Zhiyuan Liu
Tsinghua University
H
Hao Peng
UIUC
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
W
Wanli Ouyang
Shanghai AI Laboratory
Y
Yu Cheng
Shanghai AI Laboratory, CUHK
B
Bowen Zhou
Shanghai AI Laboratory, Tsinghua University
N
Ning Ding
Tsinghua University, Shanghai AI Laboratory