Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing mobile GUI agents exhibit poor generalization in real-world scenarios, struggling with long-tail applications and dynamic user requirements: end-to-end approaches rely heavily on commonsense knowledge and lack robustness, while non-interactive agents fail to incorporate user collaboration, degrading user experience. This paper proposes an interactive multi-agent architecture integrating global task planning, application-level execution (with short- and long-term memory), and a self-learning module. Leveraging a dual-loop coordination mechanism and a structured App Map knowledge base, the system enables cross-application orchestration, real-time user participation, and automatic conversion of operational experience into reusable knowledge. Its core innovation lies in unifying user interaction, multi-agent collaboration, and continual learning within a single modeling framework to achieve autonomous system evolution. Evaluated on the RealMobile-Eval benchmark, our approach improves user requirement completion rate by 33.7% and reduces redundant operations by 58.5%.

Technology Category

Application Category

📝 Abstract

Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model's commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of mobile GUI agents in real-world scenarios

Enabling cross-app collaboration and interactive task execution

Facilitating continual learning through self-evolving multi-agent architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Task Planner decomposes tasks across apps

App-Level Executor refines tasks with memory and interaction

Self-Learner consolidates experience into App Map and Tricks

🔎 Similar Papers

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents

2024-07-12Citations: 0

Apple

Santa Clara, United States of America

Research Scientist Intern, Multimodal AI (PhD)