Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current mobile GUI agent evaluation suffers from two key limitations: offline benchmarks rely on single-path trajectories and penalize semantically valid alternative actions, while online benchmarks lack scalability and reproducibility and treat agents as black boxes, hindering performance bottleneck attribution. This paper introduces MobiBench—the first high-fidelity offline benchmark framework supporting modular evaluation and multi-path tolerance. Its contributions are threefold: (1) a module-level, multi-path–aware evaluation paradigm that decouples component testing and models action trajectory diversity; (2) multi-path rationality scoring and LFM (Language-Model-based Function Mapping) capability boundary analysis for fine-grained bottleneck diagnosis; and (3) empirical validation showing 94.72% agreement with human evaluation—on par with online benchmarks—and revealing optimal trade-offs between model scale and module configuration. MobiBench provides an evidence-based guideline for designing lightweight, efficient GUI agents.

Technology Category

Application Category

📝 Abstract
Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of single-path offline and online GUI agent benchmarks
Introduces modular framework for component-level analysis of mobile GUI agents
Enables scalable, reproducible offline evaluation with high human agreement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular offline benchmarking framework
Multi-path aware evaluation approach
Component-level analysis for performance insights
🔎 Similar Papers
No similar papers found.
Youngmin Im
Youngmin Im
School of Computing, Korea Advanced Institute of Science and Technology
B
Byeongung Jo
Sungkyunkwan University, S. Korea
J
Jaeyoung Wi
KAIST, S. Korea
S
Seungwoo Baek
Sungkyunkwan University, S. Korea
T
Tae Hoon Min
Sungkyunkwan University, S. Korea
J
Joo Hyung Lee
Sungkyunkwan University, S. Korea
Sangeun Oh
Sangeun Oh
Korea university
Mobile computingEmbedded systemsReal-time systems
Insik Shin
Insik Shin
Professor, School of Computing, KAIST
Mobile computingSystems securityReal-time systems
S
Sunjae Lee
Sungkyunkwan University, S. Korea