ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Large language model (LLM)-based agents face significant challenges in GUI interaction—namely, long-horizon decision-making, sparse rewards, and high trial-and-error costs—hindering effective policy optimization. Method: We propose Agentic Replay Policy Optimization (ARPO), a reinforcement learning framework that integrates experience replay with multimodal state encoding (GUI screenshots + textual instructions), incorporates a baseline-driven task filtering mechanism to enhance training stability, and builds upon Group Relative Policy Optimization (GRPO) to enable end-to-end vision-language joint policy learning. Contribution/Results: Unlike offline preference learning, ARPO substantially improves policy generalization and task success rates. On the OSWorld benchmark, it achieves state-of-the-art RL performance for GUI automation and establishes the first dedicated RL training baseline for LLM-based GUI agents.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.

Problem

Research questions and friction points this paper is trying to address.

Optimizing long-horizon action sequences for GUI agents

Addressing sparse rewards and delayed feedback in RL

Improving performance on complex computer tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end RL with replay buffer for GUI agents

Task selection strategy based on baseline performance

Policy-based optimization outperforms offline preference methods

🔎 Similar Papers

No similar papers found.