๐ค AI Summary
This work addresses the entropy collapse phenomenon in Reinforcement Learning with Verifiable Rewards (RLVR), where prolonged training often leads to rapid entropy decay, overconfidence, and vanishing gradients, thereby impeding effective learning. For the first time, the study establishes a theoretical link between gradient-preserving clipping and precise entropy control. By analyzing how different regions of the importance sampling ratio contribute to entropy dynamics, the authors propose an entropy regulation mechanism based on dynamic clipping thresholds and design several time-varying entropy evolution strategiesโsuch as increase-then-decrease and oscillatory decay. This approach overcomes the limitations of static entropy scheduling, significantly mitigating entropy collapse across multiple benchmark tasks, enhancing performance, and simultaneously preserving higher output diversity and training stability.
๐ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.