🤖 AI Summary
To address critical challenges in real-world mobile GUI environments—including perceptual ambiguity, inaccurate element localization, and weak reasoning capabilities—this paper proposes a general-purpose mobile GUI agent framework. Methodologically, it introduces a fine-grained vision–semantics multimodal alignment mechanism and a unified discrete action space, underpinned by a meta-planning reasoning architecture; incorporates a spatially enhanced composite reward function and a dual-filter reinforcement learning fine-tuning strategy; and establishes an automated data pipeline for crawling, annotation, and continual pretraining. Evaluated on the proprietary Magic-RICH benchmark and over ten public GUI navigation tasks, the proposed method achieves state-of-the-art performance across all benchmarks. It significantly improves generalization and robustness on complex, dynamic interfaces and enhances practical deployability on mobile devices. This work establishes a scalable, end-to-end technical paradigm for embodied intelligent agents in mobile GUI settings.
📝 Abstract
This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.