AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents for mobile platforms suffer from noisy training data, insufficient semantic diversity, weak cross-lingual generalization, and challenges in on-device deployment—particularly within the Chinese mobile ecosystem. To address these issues, this paper introduces the first 8B-parameter on-device GUI agent tailored for Chinese mobile environments. Methodologically: (1) we propose grounding-aware pretraining to enhance UI understanding and action grounding; (2) we construct a high-quality bilingual (Chinese–English) trajectory dataset and apply multilingual supervised fine-tuning; (3) we integrate GRPO-based reinforcement fine-tuning with a compact discrete action space to improve cross-app generalization and robust decision-making. Our agent achieves state-of-the-art performance on five public benchmarks and our newly established Chinese CAGUI benchmark (Type-Match: 96.9%, Exact-Match: 91.3%), while enabling low-latency on-device inference. The code, model, and evaluation datasets are fully open-sourced.

Technology Category

Application Category

📝 Abstract
The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9%$ Type-Match and $91.3%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.
Problem

Research questions and friction points this paper is trying to address.

Noisy and semantically limited training data hinders precise GUI interaction learning
Models overfit to seen interfaces and fail in unfamiliar scenarios
Prior work neglects non-English interfaces like Chinese mobile ecosystems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement fine-tuning with GRPO
Grounding-aware pre-training for perception
Compact action space for mobile efficiency
🔎 Similar Papers
No similar papers found.
Zhong Zhang
Zhong Zhang
Tsinghua University
Large Language ModelsLLM AgentsNatural Language Processing
Y
Yaxi Lu
Tsinghua University
Yikun Fu
Yikun Fu
Beijing Institute of Technology
Y
Yupeng Huo
Renmin University of China
Shenzhi Yang
Shenzhi Yang
Zhejiang University
machine learninglearning theorylarge language models
Yesai Wu
Yesai Wu
Tsinghua University, ModelBest.Inc, Huazhong University of Science and Technology
Autonomous AgentTool LearningLarge Language Model
H
Han Si
Tsinghua University
Xin Cong
Xin Cong
Tsinghua University
Tool LearningAutonomous AgentLarge Language ModelKnowledge Graph
Haotian Chen
Haotian Chen
University of California, Los Angeles
Political EconomyNon-market StrategyAmerican Politics
Yankai Lin
Yankai Lin
Associate Professor (Tenure Track), Gaoling School of AI, Renmin University of China
Natural Language ProcessingLarge Language Models
J
Jie Xie
Tsinghua University
W
Wei Zhou
Tsinghua University
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
Y
Yuanheng Zhang
Tsinghua University
Zhou Su
Zhou Su
Xi'an Jiaotong University
Zhongwu Zhai
Zhongwu Zhai
ModelBest Inc.
X
Xiaoming Liu
ModelBest Inc.
Y
Yudong Mei
ModelBest Inc.
J
Jianming Xu
ModelBest Inc.
H
Hongyan Tian
ModelBest Inc.
C
Chongyi Wang
ModelBest Inc.
C
Chi Chen
Tsinghua University
Y
Yuan Yao
Tsinghua University
Z
Zhiyuan Liu
Tsinghua University
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing