MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address critical challenges in real-world mobile GUI environments—including perceptual ambiguity, inaccurate element localization, and weak reasoning capabilities—this paper proposes a general-purpose mobile GUI agent framework. Methodologically, it introduces a fine-grained vision–semantics multimodal alignment mechanism and a unified discrete action space, underpinned by a meta-planning reasoning architecture; incorporates a spatially enhanced composite reward function and a dual-filter reinforcement learning fine-tuning strategy; and establishes an automated data pipeline for crawling, annotation, and continual pretraining. Evaluated on the proprietary Magic-RICH benchmark and over ten public GUI navigation tasks, the proposed method achieves state-of-the-art performance across all benchmarks. It significantly improves generalization and robustness on complex, dynamic interfaces and enhances practical deployability on mobile devices. This work establishes a scalable, end-to-end technical paradigm for embodied intelligent agents in mobile GUI settings.

Technology Category

Application Category

📝 Abstract

This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.

Problem

Research questions and friction points this paper is trying to address.

Develops a mobile GUI agent for perception and reasoning challenges

Creates a scalable data pipeline for diverse GUI-centric data

Enhances UI interaction with unified action space and planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable GUI Data Pipeline for diverse multimodal data

Reinforcement fine-tuning with composite reward strategy

Planning-oriented reasoning for sequential action decomposition

🔎 Similar Papers

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices