UIPro: Unleashing Superior Interaction Capability For GUI Agents

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents suffer from closed-domain assumptions, scarce training data, and heterogeneous action spaces. Method: We propose a general-purpose GUI agent framework featuring: (1) a unified, generalizable discrete action space enabling cross-platform and cross-task operation; (2) a large-scale, multi-platform GUI understanding dataset comprising 20.6 million samples to enhance joint visual–interface–action representation; and (3) an end-to-end learning paradigm combining large-scale vision-language model pretraining with task-adaptive fine-tuning. Results: Our agent achieves significant improvements over state-of-the-art methods across multiple GUI benchmarks—including Android and desktop environments—demonstrating strong generalization capability and robust interactive performance. The framework establishes a scalable technical pathway toward truly general-purpose GUI agents.

Technology Category

Application Category

📝 Abstract
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes extbf{UIPro}, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Building autonomous agents that perceive and operate GUIs like humans
Overcoming limitations of existing GUI agents with heterogeneous action spaces
Developing generalist GUI agents with unified action space across platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensive multi-platform GUI interaction data training
Unified action space harmonizing heterogeneous datasets
GUI grounding capability via pre-training on understanding tasks
🔎 Similar Papers
No similar papers found.
H
Hongxin Li
University of Chinese Academy of Sciences (UCAS), New Laboratory of Pattern Recognition (NLPR), CASIA, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA, StepFun
Jingran Su
Jingran Su
PolyU
Jingfan Chen
Jingfan Chen
The Hong Kong Polytechnic University
AgentLarge Language ModelGraph Neural NetworksRecommender Systems
Z
Zheng Ju
University of Chinese Academy of Sciences (UCAS), New Laboratory of Pattern Recognition (NLPR), CASIA, State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA
Yuntao Chen
Yuntao Chen
Miromind
agentic aimultimodal modelcomputer vision
Q
Qing Li
PolyU
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning