🤖 AI Summary
Existing evaluation benchmarks for mobile GUI agents struggle to encompass real-world complex applications and long-horizon interactions, and lack effective reward mechanisms in closed-source settings. To address these limitations, this work proposes SimuWoB—a fully synthetic, high-fidelity benchmark comprising 120 diverse tasks spanning multiple types and difficulty levels. SimuWoB leverages backend-free web pages to enable lightweight, reproducible evaluation, with its core innovation lying in the automatic generation of virtual environments and task rewards. This approach uniquely supports comprehensive assessment of complex, long-horizon interaction capabilities without requiring access to real applications. Experimental results show that state-of-the-art agents achieve only a 27.92% average success rate on SimuWoB, dropping to 17.82% on long-horizon tasks, and demonstrate that evaluation outcomes effectively generalize to real-world scenarios.
📝 Abstract
Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.