Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing GUI agents suffer from knowledge limitations and the offline-online domain gap, hindering performance on long-horizon online tasks. To address this, we propose Mirage-1, a cross-platform GUI agent introducing Hierarchical Multimodal Skills (HMS)—a novel skill representation that decouples skills into three abstraction levels: execution, core, and meta. We further design Skill-Augmented Monte Carlo Tree Search (SA-MCTS) to enable dynamic online planning and plug-and-play skill updates. Our approach integrates multimodal large language models, GUI trajectory abstraction, and our newly proposed AndroidLH benchmark. Experiments demonstrate that Mirage-1 achieves substantial improvements in task completion rates: +32% on AndroidWorld, +19% on MobileMiniWob++, +15% on Mind2Web-Live, and +79% on AndroidLH—significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32%, 19%, 15%, and 79% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient knowledge in GUI agents for long-horizon tasks

Bridges offline-online domain gap for multimodal GUI agents

Enhances task planning with hierarchical multimodal skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Multimodal Skills for knowledge abstraction

Skill-Augmented Monte Carlo Tree Search algorithm

Mirage-1 multimodal cross-platform GUI agent

🔎 Similar Papers

No similar papers found.