🤖 AI Summary
GUI-driven mobile agents face core challenges in real-world deployment, including low task execution accuracy, inefficient reasoning, and scarcity of high-quality annotated data. To address these issues, this paper proposes MobiMind—a holistic system framework comprising: (1) a family of mobile-optimized vision-language models; (2) AgentRR, a reinforcement learning–based, GUI-structure-aware reasoning acceleration framework; (3) an automated data collection pipeline supporting both self-labeling and synthetic data generation; and (4) MobiFlow, a lightweight yet multi-task benchmark suite. Extensive experiments demonstrate that MobiMind significantly outperforms general-purpose large language models and state-of-the-art GUI agents on real-device tasks, achieving new SOTA accuracy and inference speed. Moreover, it reduces human annotation cost by 67%.
📝 Abstract
With the rapid advancement of Vision-Language Models (VLMs), GUI-based mobile agents have emerged as a key development direction for intelligent mobile systems. However, existing agent models continue to face significant challenges in real-world task execution, particularly in terms of accuracy and efficiency. To address these limitations, we propose MobiAgent, a comprehensive mobile agent system comprising three core components: the MobiMind-series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite. Furthermore, recognizing that the capabilities of current mobile agents are still limited by the availability of high-quality data, we have developed an AI-assisted agile data collection pipeline that significantly reduces the cost of manual annotation. Compared to both general-purpose LLMs and specialized GUI agent models, MobiAgent achieves state-of-the-art performance in real-world mobile scenarios.