Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mobile GUI agents exhibit poor generalization in real-world scenarios, struggling with long-tail applications and dynamic user requirements: end-to-end approaches rely heavily on commonsense knowledge and lack robustness, while non-interactive agents fail to incorporate user collaboration, degrading user experience. This paper proposes an interactive multi-agent architecture integrating global task planning, application-level execution (with short- and long-term memory), and a self-learning module. Leveraging a dual-loop coordination mechanism and a structured App Map knowledge base, the system enables cross-application orchestration, real-time user participation, and automatic conversion of operational experience into reusable knowledge. Its core innovation lies in unifying user interaction, multi-agent collaboration, and continual learning within a single modeling framework to achieve autonomous system evolution. Evaluated on the RealMobile-Eval benchmark, our approach improves user requirement completion rate by 33.7% and reduces redundant operations by 58.5%.

Technology Category

Application Category

📝 Abstract
Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model's commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of mobile GUI agents in real-world scenarios
Enabling cross-app collaboration and interactive task execution
Facilitating continual learning through self-evolving multi-agent architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Task Planner decomposes tasks across apps
App-Level Executor refines tasks with memory and interaction
Self-Learner consolidates experience into App Map and Tricks
🔎 Similar Papers
No similar papers found.
J
Jiazheng Sun
Fudan University, College of Computer Science and Artificial Intelligence
Te Yang
Te Yang
Institute of Automation, Chinese Academy of Sciences
Multimodal Large Language Models
J
Jiayang Niu
Fudan University, College of Computer Science and Artificial Intelligence
M
Mingxuan Li
Fudan University, College of Computer Science and Artificial Intelligence
Y
Yongyong Lu
Fudan University, College of Computer Science and Artificial Intelligence
Ruimeng Yang
Ruimeng Yang
Fudan University, College of Computer Science and Artificial Intelligence
Xin Peng
Xin Peng
East China University of Science and Technology
Artificial IntelligenceMachine LearningComplex Process Modeling