MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing benchmarks for mobile GUI agents inadequately account for complex reasoning, environmental exploration, and real-world noise, limiting their ability to reflect true performance in practical scenarios. To address this gap, this work proposes MobileBench-OL—the first online evaluation benchmark tailored for Chinese mobile applications—comprising 80 apps and 1,080 tasks. It systematically assesses agent capabilities across five subsets targeting task execution, complex reasoning, exploratory behavior, and robustness to noise. Built on real Android devices, the benchmark integrates automated task scheduling, state-reset mechanisms, multidimensional evaluation metrics, and human validation. Evaluations of 12 state-of-the-art agents reveal significant performance deficiencies in realistic settings, thereby demonstrating the effectiveness and necessity of MobileBench-OL as a comprehensive and rigorous evaluation framework.

Technology Category

Application Category

📝 Abstract

Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents'task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

mobile GUI agents

evaluation benchmark

reasoning ability

exploration ability

real-world noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

mobile GUI agents

online benchmark

noise robustness