EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In sparse-reward, multi-turn environments, LLM-based agents suffer from an “exploration-exploitation cascade failure” in reinforcement learning—where early policies prematurely converge to low-entropy defective strategies, and later entropy regularization paradoxically triggers policy collapse. Method: We propose an entropy-regularized policy optimization framework that systematically identifies this cascade failure mechanism for the first time. Our approach introduces (i) entropy-smoothing regularization—constraining the mean historical entropy—and (ii) an adaptive, phase-wise weighting scheme, jointly ensuring monotonic reduction of entropy variance and stable policy convergence. Contribution/Results: The framework combines theoretical rigor with engineering practicality. Evaluated on ScienceWorld and ALFWorld benchmarks, it achieves 152% and 19.8% absolute improvements in task success rate, respectively, significantly surpassing existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
Problem

Research questions and friction points this paper is trying to address.

Addresses premature policy convergence in sparse-reward environments
Prevents policy collapse caused by chaotic entropy regularization
Enables stable multi-turn reinforcement learning for LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy regularization in multi-turn settings
Entropy smoothing regularizer bounds policy entropy
Adaptive phase-based weighting balances exploration exploitation
🔎 Similar Papers
No similar papers found.