Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-horizon LLM agents struggle with effective credit assignment under sparse rewards. This paper identifies a strong coupling between LLM policy gradient magnitudes and action entropy, leading to training instability. To address this, we propose the Entropy-Modulated Policy Gradient (EMPG) framework: it dynamically recalibrates gradients at each time step based on step-level uncertainty estimates—amplifying updates for high-confidence actions while suppressing interference from low-certainty steps—and introduces a “future clarity” process reward to encourage predictable, interpretable long-term trajectory exploration. EMPG jointly integrates uncertainty-aware gradient modulation, process reward modeling, and entropy regularization. Empirical evaluation on WebShop, ALFWorld, and Deep Search demonstrates significant improvements over state-of-the-art baselines, validating its effectiveness, training stability, and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/
Problem

Research questions and friction points this paper is trying to address.

Addresses sparse rewards in long-horizon LLM agent tasks
Resolves policy gradient magnitude coupling with entropy issue
Stabilizes learning dynamics for confident and uncertain actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modulates policy gradients using entropy-based uncertainty
Amplifies confident correct actions and penalizes errors
Introduces future clarity bonus for predictable paths
🔎 Similar Papers
No similar papers found.
J
Jiawei Wang
ByteDance
Jiacai Liu
Jiacai Liu
Fudan University
reinforcement learning
Y
Yuqian Fu
ByteDance
Y
Yingru Li
ByteDance
X
Xintao Wang
ByteDance
Yuan Lin
Yuan Lin
Ocean College, Zhejiang University
RheologyPolymer physcisMulti-phase flow
Y
Yu Yue
ByteDance
L
Lin Zhang
ByteDance
Y
Yang Wang
ByteDance
K
Ke Wang
ByteDance