MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in mobile GUI agents—including long-horizon task execution, dynamic environment adaptation, and cold-start performance in unfamiliar scenarios—this paper proposes a hierarchical reflection architecture grounded in multimodal large language models (MLLMs). The method integrates action-level and task-level state assessment, employs an on-demand reflection mechanism to enhance computational efficiency, and incorporates an active exploration module to mitigate cold-start issues. It enables cross-temporal-scale self-monitoring, error detection, and recovery, supporting end-to-end automated operation on real Android devices. Evaluated on the AndroidWorld and AndroidLab benchmarks, our approach achieves task success rates of 62.9% and 44.2%, respectively—substantially outperforming prior methods. Additionally, we open-source the first integrated GUI agent toolkit enabling seamless deployment on physical Android devices.

Technology Category

Application Category

📝 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled the development of mobile agents that can understand visual inputs and follow user instructions, unlocking new possibilities for automating complex tasks on mobile devices. However, applying these models to real-world mobile scenarios remains a significant challenge due to the long-horizon task execution, difficulty in error recovery, and the cold-start problem in unfamiliar environments. To address these challenges, we propose MobileUse, a GUI agent designed for robust and adaptive mobile task execution. To improve resilience in long-horizon tasks and dynamic environments, we introduce a hierarchical reflection architecture that enables the agent to self-monitor, detect, and recover from errors across multiple temporal scales-ranging from individual actions to overall task completion-while maintaining efficiency through a reflection-on-demand strategy. To tackle cold-start issues, we further introduce a proactive exploration module, which enriches the agent's understanding of the environment through self-planned exploration. Evaluations on AndroidWorld and AndroidLab benchmarks demonstrate that MobileUse establishes new state-of-the-art performance, achieving success rates of 62.9% and 44.2%, respectively. To facilitate real-world applications, we release an out-of-the-box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile-use.
Problem

Research questions and friction points this paper is trying to address.

Addresses long-horizon mobile task execution challenges
Improves error recovery in dynamic mobile environments
Solves cold-start issues in unfamiliar mobile interfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical reflection for error recovery
Proactive exploration for cold-start issues
Out-of-the-box toolkit for mobile automation
🔎 Similar Papers
No similar papers found.
N
Ning Li
Shanghai Jiao Tong University
Xiangmou Qu
Xiangmou Qu
OPPO Research Institute
Machine learningdistributed system
J
Jiamu Zhou
OPPO Research Institute
J
Jun Wang
OPPO Research Institute
Muning Wen
Muning Wen
Research Assistant Professor, Shanghai Jiao Tong University
(multi-agent) reinforcement learninglanguage agent/LLM-based agent
Kounianhua Du
Kounianhua Du
上海交通大学
Data ScienceLarge Language Models
X
Xingyu Lou
OPPO Research Institute
Qiuying Peng
Qiuying Peng
OPPO Research Institute
artificial intelligence
J
Jun Wang
OPPO Research Institute
W
Weinan Zhang
Shanghai Jiao Tong University