Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limited exploration capability of large language model (LLM) agents in reinforcement learning, which often hinders their discovery of novel states. To overcome this challenge, the authors propose EMPO², a novel framework that uniquely integrates memory-augmented mechanisms with hybrid online/offline policy optimization. By leveraging memory to guide exploration and jointly optimizing the policy, EMPO² substantially enhances both exploration efficiency and generalization across in-distribution and out-of-distribution tasks. Experimental results demonstrate that EMPO² outperforms GRPO by 128.6% on ScienceWorld and by 11.3% on WebShop. Notably, in out-of-distribution scenarios, the agent rapidly adapts with only a few memory-informed trials, without requiring any parameter updates.

Technology Category

Application Category

📝 Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Problem

Research questions and friction points this paper is trying to address.

exploration

large language model agents

reinforcement learning

novel states

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-augmented

hybrid on- and off-policy

exploration