On Entropy Control in LLM-RL Algorithms

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In large language model reinforcement learning (LLM-RL), conventional entropy regularization fails due to the enormous output space and sparse optimal responses. To address this, we propose Adaptive Entropy Tuning (AEnt), a novel entropy control framework. AEnt introduces three key innovations: (1) policy renormalization over a dynamically identified subset of tokens, effectively narrowing the exploration space; (2) truncated entropy computation that excludes low-probability, redundant tokens to mitigate estimation noise; and (3) automatic, gradient-guided adjustment of the entropy coefficient to balance exploration and convergence adaptively. Evaluated across multiple mathematical reasoning benchmarks and mainstream LLMs, AEnt consistently improves policy stability and final task performance, outperforming established baselines—including PPO and RLOO—across diverse settings. Our approach establishes a scalable, robust paradigm for entropy regulation in LLM-RL, addressing fundamental limitations of static entropy penalties in high-dimensional discrete action spaces.

Technology Category

Application Category

📝 Abstract

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM's extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Studying entropy regularization issues in LLM-RL algorithms

Addressing weak entropy gains due to large response space

Proposing clamped entropy method for better exploration control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clamped entropy bonus for LLM-RL control

Automatically adjusted entropy coefficient mechanism

Renormalized policy on compact token space

🔎 Similar Papers

No similar papers found.