Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the low reinforcement learning efficiency and scarcity of high-quality trajectory data for vision-language model (VLM)-driven GUI agents in multi-turn interactions, this paper proposes DART, a decoupled asynchronous training framework. DART fully decouples four core components—environment interaction, rollout generation, data management, and policy training—enabling non-blocking communication and fine-grained, rollout-level asynchronous coordination. It further introduces an adaptive data filtering mechanism integrating trajectory-level sampling, dynamic rollout length adjustment, entropy-weighted prioritization of high-uncertainty decisions, and truncated importance reweighting. Evaluated on the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate—improving over the strongest baseline by 14.61 percentage points and surpassing open-source state-of-the-art by 7.34 percentage points. The code, datasets, and models are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

Problem

Research questions and friction points this paper is trying to address.

Addressing slow multi-turn GUI interactions for RL training

Solving insufficient high-quality agent-environment interaction data

Improving reinforcement learning efficiency for vision-language GUI agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled framework enables asynchronous multi-module training

Adaptive data curation optimizes trajectory sampling and learning

Truncated importance sampling stabilizes policy mismatch correction

🔎 Similar Papers

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices