🤖 AI Summary
Existing mobile agent benchmarks (e.g., AndroidWorld) suffer from saturation—achieving >90% success rates—and limited app coverage (e.g., lacking e-commerce and enterprise communication apps), failing to evaluate realistic challenges such as ambiguous instruction understanding, cross-app coordination, and MCP (Mobile Control Protocol) tool invocation. To address this, we propose MobileWorld: a more challenging benchmark comprising 201 long-horizon tasks across 20 mainstream Android applications. It introduces the first Agent-User interaction paradigm and an MCP-augmented task taxonomy. We build a snapshot-based, containerized Android environment with task callback APIs, database state validation, and hybrid GUI/NL/API action interfaces. Furthermore, we design a planner-executor framework supporting both conversational reasoning and MCP tool calling. Experiments reveal that the best-performing agent achieves only 51.7% success, while end-to-end models attain merely 20.9%, exposing fundamental bottlenecks in user intent modeling and tool orchestration. MobileWorld establishes a new standard and roadmap for advancing mobile agent research.
📝 Abstract
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.