🤖 AI Summary
In reinforcement learning with verifiable rewards (RLVR), large language models suffer from entropy collapse—i.e., a sharp decline in policy diversity—leading to exploration-exploitation imbalance and degraded generalization. Existing entropy regularization methods operate through opaque mechanisms, indirectly modulating advantages or token probabilities, resulting in limited efficacy and frequent failure. This work is the first to quantitatively characterize the root cause of entropy collapse at the token-level entropy dynamics. We propose STEER, a reweighting framework that directly stabilizes entropy evolution: it employs entropy-change-aware fine-grained loss reweighting and gradient adjustment to adaptively balance exploration and exploitation during training. Experiments demonstrate that STEER significantly mitigates entropy collapse, improves performance on downstream tasks—including mathematical reasoning—and enhances training stability, thereby validating the effectiveness of direct entropy-dynamics regulation.
📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks footnote{Our code is available at https://github.com/zz-haooo/STEER.