🤖 AI Summary
This work addresses the issue that standard entropy regularization in reinforcement learning for large language models often leads to ineffective exploration or even performance degradation due to the accumulation of tail risks. To mitigate this, the authors propose Trust Region Entropy (TRE), a novel approach that, for the first time, integrates trust region constraints into entropy regularization. By maximizing entropy within a local neighborhood of the current policy, TRE focuses exploration on high-confidence candidate tokens, thereby enhancing both exploration efficiency and reasoning coherence. Implemented within the PPO framework, the method combines localized entropy maximization with trust region constraints and demonstrates significant improvements over standard PPO, conventional entropy regularization, and other exploration baselines across benchmarks including MATH, Countdown, and HH.
📝 Abstract
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.