Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

📅 2025-05-17
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI benchmarks suffer from three critical limitations: unstable reward signals in online evaluation, offline evaluation relying solely on single-path trajectories—ignoring task multimodality—and no assessment of noise robustness or proactive interaction capability. To address these gaps, we introduce Mobile-Bench-v2, the first comprehensive benchmark for vision-language model-based mobile agents. Its key contributions are: (1) multi-path offline evaluation to capture GUI task diversity; (2) systematic injection of realistic disturbances—including pop-ups, ads, and AITZ-Noise; (3) fuzzy instruction-driven, question-answering–style proactive interaction; and (4) high-quality data construction via slot-based instruction generation and GUI trajectory sampling. Evaluations across leading frameworks—including AppAgent-v1 and Mobile-Agent-v2—reveal substantial deficiencies in current models’ noise resilience and proactive reasoning. The benchmark code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lack stable reward signals due to dynamic environments
Offline benchmarks ignore multi-solution nature of GUI tasks
Current evaluations fail to test noise handling and proactive interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot-based instruction generation for realistic benchmarks
Offline multi-path evaluation for step rewards
Noisy and ambiguous splits for proactive interaction
🔎 Similar Papers
No similar papers found.