🤖 AI Summary
In sparse-reward, multi-turn environments, LLM-based agents suffer from an “exploration-exploitation cascade failure” in reinforcement learning—where early policies prematurely converge to low-entropy defective strategies, and later entropy regularization paradoxically triggers policy collapse.
Method: We propose an entropy-regularized policy optimization framework that systematically identifies this cascade failure mechanism for the first time. Our approach introduces (i) entropy-smoothing regularization—constraining the mean historical entropy—and (ii) an adaptive, phase-wise weighting scheme, jointly ensuring monotonic reduction of entropy variance and stable policy convergence.
Contribution/Results: The framework combines theoretical rigor with engineering practicality. Evaluated on ScienceWorld and ALFWorld benchmarks, it achieves 152% and 19.8% absolute improvements in task success rate, respectively, significantly surpassing existing state-of-the-art methods.
📝 Abstract
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.