Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limited adaptive exploration capability of current large language model agents at test time, which struggle to determine when to explore. The authors propose an exploration-aware reinforcement learning framework that constructs a fine-grained reward function via variational inference to evaluate the potential future value of exploratory actions. An exploration-aware grouping mechanism is introduced to selectively trigger exploration under high uncertainty, while decoupling exploratory and task-execution actions during policy optimization. Evaluated across diverse text-based and GUI agent benchmarks, the method consistently improves performance, significantly enhancing both exploration efficiency and overall task success.

📝 Abstract

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at \url{https://github.com/HansenHua/EAPO-ICML26} and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

Problem

Research questions and friction points this paper is trying to address.

agentic reasoning

exploration strategy

uncertainty

reinforcement learning

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

exploration-aware

adaptive exploration

variational inference