🤖 AI Summary
Existing pixel-based world models struggle to accurately capture complex visual state transitions in mobile GUI environments. To address this, we propose a semantic world modeling paradigm that represents GUI states and their transitions using natural language descriptions instead of raw pixels. Methodologically, we introduce the first semantic modeling framework for mobile GUI agents, comprising (1) multimodal large model (VLM)-driven semantic state prediction, (2) alignment modeling of GUI action–feedback pairs, and (3) a collaborative integration mechanism between the VLM and the planning module. We further release MobileWorldBench—the first dedicated evaluation benchmark for mobile GUI world modeling—and MobileWorld, a large-scale dataset containing 1.4 million samples. Experiments demonstrate substantial improvements in task success rates for mobile agents, empirically validating the effectiveness and practicality of semantic world models in real-world GUI environments.
📝 Abstract
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld