UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address core challenges in GUI agent development—including data scalability, multi-turn reinforcement learning (RL) instability, exclusive reliance on GUI interactions, and environmental volatility—this paper proposes a systematic training framework. We introduce a data flywheel to generate high-quality, large-scale GUI interaction trajectories; design a stable multi-turn RL algorithm; pioneer a hybrid GUI environment integrating filesystem and terminal access; and develop a unified sandbox platform enabling massively parallel rollouts. Our approach adopts a native, centralized end-to-end GUI architecture, synergizing environment simulation with distributed training. Evaluated on standard benchmarks—Mind2Web (88.2), OSWorld (47.5), WindowsAgentArena (50.6), and AndroidWorld (73.3)—our method significantly outperforms state-of-the-art baselines. It achieves 60% of human-level performance on game tasks and demonstrates strong generalization in long-horizon and software engineering scenarios.

Technology Category

Application Category

📝 Abstract

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses GUI agent challenges in data scalability

Improves multi-turn reinforcement learning for GUI agents

Enhances GUI agent robustness across diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data flywheel for scalable GUI data generation

Stabilized multi-turn reinforcement learning framework

Hybrid environment integrating file systems and terminals

🔎 Similar Papers

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices