APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the challenge of sustained learning in self-evolving large language model agents during testing, which often suffer from exploration collapse due to accumulated memory and over-reliance on known high-reward behaviors. The authors propose a weight-update-free continual exploration framework that constructs a Strategy Map—comprising milestones as nodes and dependency relations as edges—and integrates an evidence-based Fork Discovery mechanism with a balanced exploration–exploitation policy selection strategy. This enables systematic expansion into unexplored directions and dynamic decision-making. Evaluated on nine Jericho text-based adventure games and the WebArena web interaction benchmark, the approach significantly outperforms existing methods. Ablation studies confirm the contribution of each component, and the framework demonstrates robust cross-scenario generalization.
📝 Abstract
LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.
Problem

Research questions and friction points this paper is trying to address.

exploration collapse
self-evolving agents
LLM agents
long-horizon decision making
strategy exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous Policy Exploration
Strategy Map
Exploration-Exploitation Balance
Self-Evolving Agents
Directed Acyclic Graph
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866