MobileDreamer: Generative Sketch World Model for GUI Agent

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing mobile GUI agents, which predominantly rely on reactive decision-making and struggle with long-horizon tasks. To overcome this, we propose a world model framework grounded in textual sketches that predicts post-action GUI states by generating task-relevant textual descriptions. The framework incorporates an imagination-based planning mechanism to refine action selection and introduces a permutation-invariant learning strategy that preserves spatial awareness while enabling efficient state prediction. Evaluated on the Android World benchmark, our method achieves state-of-the-art performance, improving task success rate by 5.25% and accurately forecasting key GUI elements, thereby significantly enhancing the agent’s capacity for foresighted planning.

Technology Category

Application Category

πŸ“ Abstract
Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.
Problem

Research questions and friction points this paper is trying to address.

mobile GUI agent
world model
long-horizon tasks
spatial awareness
action outcome prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
textual sketch
order-invariant learning
rollout imagination
GUI agent
πŸ”Ž Similar Papers
No similar papers found.
Y
Yilin Cao
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
Z
Zhixiong Zeng
Meituan
L
Liming Zheng
Meituan
J
Jing Huang
Meituan
Haibo Qiu
Haibo Qiu
University of Sydney
Multimodal LLMVision and LanguageComputer Vision
P
Peng Shi
Meituan
Wenji Mao
Wenji Mao
Professor at Institute of Automation, Chinese Academy of Sciences
Artificial IntelligenceIntelligent AgentsSocial Modeling and Computing
G
Guanglu Wan
Meituan