MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address critical challenges in real-world mobile GUI environments—including perceptual ambiguity, inaccurate element localization, and weak reasoning capabilities—this paper proposes a general-purpose mobile GUI agent framework. Methodologically, it introduces a fine-grained vision–semantics multimodal alignment mechanism and a unified discrete action space, underpinned by a meta-planning reasoning architecture; incorporates a spatially enhanced composite reward function and a dual-filter reinforcement learning fine-tuning strategy; and establishes an automated data pipeline for crawling, annotation, and continual pretraining. Evaluated on the proprietary Magic-RICH benchmark and over ten public GUI navigation tasks, the proposed method achieves state-of-the-art performance across all benchmarks. It significantly improves generalization and robustness on complex, dynamic interfaces and enhances practical deployability on mobile devices. This work establishes a scalable, end-to-end technical paradigm for embodied intelligent agents in mobile GUI settings.

Technology Category

Application Category

📝 Abstract
This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.
Problem

Research questions and friction points this paper is trying to address.

Develops a mobile GUI agent for perception and reasoning challenges
Creates a scalable data pipeline for diverse GUI-centric data
Enhances UI interaction with unified action space and planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable GUI Data Pipeline for diverse multimodal data
Reinforcement fine-tuning with composite reward strategy
Planning-oriented reasoning for sequential action decomposition
🔎 Similar Papers
No similar papers found.
L
Liujian Tang
Honor Device Co., Ltd
Shaokang Dong
Shaokang Dong
Honor Device Co., Ltd
Multi-agent RLRLHFLLM Agent
Y
Yijia Huang
Honor Device Co., Ltd
M
Minqi Xiang
Honor Device Co., Ltd
H
Hongtao Ruan
Honor Device Co., Ltd
B
Bin Wang
Honor Device Co., Ltd
S
Shuo Li
Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Z
Zhihui Cao
Honor Device Co., Ltd
H
Hailiang Pang
Honor Device Co., Ltd
H
Heng Kong
Honor Device Co., Ltd
He Yang
He Yang
Xi'an Jiaotong University
Federated LearningDeep LearningPrivacy & Security
Mingxu Chai
Mingxu Chai
Fudan University
Z
Zhilin Gao
Honor Device Co., Ltd
X
Xingyu Liu
Honor Device Co., Ltd
Y
Yingnan Fu
Honor Device Co., Ltd
J
Jiaming Liu
Honor Device Co., Ltd
T
Tao Gui
Fudan University
X
Xuanjing Huang
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI
Q
Qi Zhang
Fudan University
K
Kang Wang
Honor Device Co., Ltd
Y
Yunke Zhang
Honor Device Co., Ltd
Y
Yuran Wang
Honor Device Co., Ltd