Agentic Entropy-Balanced Policy Optimization

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address training collapse in web agents caused by overreliance on entropy signals during long-horizon tool invocation, this paper proposes an Entropy-Balanced Reinforcement Learning framework. Methodologically, it integrates dynamic rollouts, multi-step tool modeling, and stable policy updates. Key contributions include: (1) entropy pre-monitoring and branching penalty, enabling adaptive allocation of global–local sampling budgets; and (2) stop-gradient high-entropy clipping coupled with entropy-aware advantage estimation, ensuring gradient flow and prioritized learning for high-uncertainty tokens. Evaluated on 14 challenging benchmarks—including GAIA, Humanity’s Last Exam, and WebWalker—the method consistently outperforms seven state-of-the-art RL algorithms. With only 1K training samples, it achieves a peak Pass@5 score of 70.0%, significantly improving both multi-step decision stability and sample efficiency.

Technology Category

Application Category

📝 Abstract

Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.

Problem

Research questions and friction points this paper is trying to address.

Excessive entropy reliance causes training collapse in agentic RL

Unbalanced entropy signals constrain multi-turn tool-use capabilities

High-uncertainty tool-call steps lead to over-branching issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic entropy-balanced rollout mechanism for adaptive sampling

Stop-gradient operation in high-entropy clipping for gradient preservation

Entropy-aware advantage estimation to prioritize uncertain tokens

🔎 Similar Papers

No similar papers found.