UI-TARS: Pioneering Automated GUI Interaction with Native Agents

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses core challenges in GUI agents—limited cross-platform generalization, imprecise action grounding, and weak long-horizon planning—by proposing UI-TARS, an end-to-end native GUI agent. UI-TARS takes raw screen screenshots as input and directly outputs pixel-level keyboard/mouse actions, eliminating reliance on large language model wrappers or manual prompt engineering. Methodologically, it integrates enhanced visual perception, unified cross-platform action modeling, System-2 multi-step reasoning (including task decomposition, milestone identification, and reflective refinement), and an iterative training paradigm grounded in online trajectory reflection. High-quality, cross-platform GUI trajectories are automatically collected, filtered, and optimized via a virtual machine cluster; a standardized action space and large-scale GUI screenshot dataset are constructed. UI-TARS achieves state-of-the-art performance across 10+ benchmarks—including OSWorld (24.6/50 steps) and AndroidWorld (46.6)—outperforming GPT-4o, Claude, and other SOTA methods.

Technology Category

Application Category

📝 Abstract
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
Problem

Research questions and friction points this paper is trying to address.

Automated GUI Operation
Cross-platform Capability
Self-learning Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous Screen Recognition
Cross-Platform Precision
Human-like Decision Making
🔎 Similar Papers
No similar papers found.
Yujia Qin
Yujia Qin
ByteDance
Agent
Yining Ye
Yining Ye
Tsinghua University, Bytedance
Tool LearningAgentUnified-LM
J
Junjie Fang
ByteDance Seed
Haoming Wang
Haoming Wang
University of Pittsburgh
Federated learning
Shihao Liang
Shihao Liang
ByteDance
Multimodal AgentAgent Evaluation
Shizuo Tian
Shizuo Tian
Tsinghua University
J
Junda Zhang
ByteDance Seed
J
Jiahao Li
ByteDance Seed
Y
Yunxin Li
ByteDance Seed
Shijue Huang
Shijue Huang
Hong Kong University of Science and Technology
Large Language ModelsReasoningAgent
Wanjun Zhong
Wanjun Zhong
Bytedance Seed Research
NLP
K
Kuanye Li
ByteDance Seed
J
Jiale Yang
ByteDance Seed
Y
Yu Miao
ByteDance Seed
W
Woyu Lin
ByteDance Seed
L
Longxiang Liu
ByteDance Seed
Xu Jiang
Xu Jiang
Duke University
Information economicsaccounting standard settingreal effectsdisclosurefinancial institutions
Q
Qianli Ma
ByteDance Seed
J
Jingyu Li
ByteDance Seed
X
Xiaojun Xiao
ByteDance Seed
Kai Cai
Kai Cai
Osaka Metropolitan University
Systems ControlMulti-Agent SystemsRobotic NetworksDiscrete-Event SystemsCyber-Physical Systems
Chuang Li
Chuang Li
University of Science and Technology of China (USTC)
Stimuli-responsive hydrogelsDynamic soft materialsMolecular photoswitchesPhotoactuatorsSupramolecular DNA hydrogel
Yaowei Zheng
Yaowei Zheng
Ph.D. student, Beihang University
Machine LearningNatural Language Processing
C
Chaolin Jin
ByteDance Seed
C
Chen Li
ByteDance Seed
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL
M
Minchao Wang
ByteDance Seed
H
Haoli Chen
ByteDance Seed
Zhaojian Li
Zhaojian Li
Red Cedar Distinguished Associate Professor, Michigan State University
ControlsLearningRoboticsConnected VehiclesSmart Agriculture
H
Haihua Yang
ByteDance Seed
Haifeng Liu
Haifeng Liu
Zhejiang University
Machine LearningData ManagementInformaiton Retrieval
F
Feng Lin
ByteDance Seed
Tao Peng
Tao Peng
吉林大学
natural language processingknowledge graph
X
Xin Liu
ByteDance Seed
G
Guang Shi
ByteDance Seed