Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low reinforcement learning efficiency and scarcity of high-quality trajectory data for vision-language model (VLM)-driven GUI agents in multi-turn interactions, this paper proposes DART, a decoupled asynchronous training framework. DART fully decouples four core components—environment interaction, rollout generation, data management, and policy training—enabling non-blocking communication and fine-grained, rollout-level asynchronous coordination. It further introduces an adaptive data filtering mechanism integrating trajectory-level sampling, dynamic rollout length adjustment, entropy-weighted prioritization of high-uncertainty decisions, and truncated importance reweighting. Evaluated on the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate—improving over the strongest baseline by 14.61 percentage points and surpassing open-source state-of-the-art by 7.34 percentage points. The code, datasets, and models are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
Problem

Research questions and friction points this paper is trying to address.

Addressing slow multi-turn GUI interactions for RL training
Solving insufficient high-quality agent-environment interaction data
Improving reinforcement learning efficiency for vision-language GUI agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled framework enables asynchronous multi-module training
Adaptive data curation optimizes trajectory sampling and learning
Truncated importance sampling stabilizes policy mismatch correction
🔎 Similar Papers
No similar papers found.
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
Zechen Hu
Zechen Hu
George Mason University
Reinforcement learning
Z
Zirui Shang
Beijing Institute of Technology, State Key Laboratory of General Artificial Intelligence, BIGAI
J
Jingrong Wu
DataCanvas
Y
Yang Liu
State Key Laboratory of General Artificial Intelligence, BIGAI
H
Hui Liu
DataCanvas
Z
Zhi Gao
Beijing Institute of Technology, State Key Laboratory of General Artificial Intelligence, BIGAI
Chenrui Shi
Chenrui Shi
Beijing Institute of Technology
anomaly detection
Bofei Zhang
Bofei Zhang
BIGAI
Zihao Zhang
Zihao Zhang
天津大学
计算机视觉
X
Xiaochuan Shi
DataCanvas
Z
Zedong Yu
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing University of Posts and Telecommunications
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
X
Xinxiao Wu
Beijing Institute of Technology, Shenzhen MSU-BIT University
Y
Yunde Jia
Shenzhen MSU-BIT University
Liuyu Xiang
Liuyu Xiang
Beijing University of Posts and Telecommunications
Computer VisionReinforcement LearningLLM Agent
Z
Zhaofeng He
Beijing University of Posts and Telecommunications
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI