Benchmarking Mobile Device Control Agents across Diverse Configurations

📅 2024-04-25

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 2

career value

220K/year

🤖 AI Summary

Existing mobile device-controlling agents lack standardized benchmarks, hindering rigorous performance evaluation and cross-method comparison. This paper introduces B-MoCA, the first benchmark for mobile agents operating on real Android systems, comprising 131 everyday tasks. It employs multi-dimensional randomization—including UI layouts and natural-language instructions—to systematically assess generalization capability. Its key contributions are: (1) an interactive benchmarking framework supporting cross-device configuration generalization; (2) the first systematic robustness evaluation of LLMs, multimodal LLMs, and imitation-learning agents within a realistic system-level simulation environment; and (3) full open-sourcing of code, execution environments, and human expert demonstration data. Experiments reveal that state-of-the-art methods achieve only ~78% success on simple tasks but drop to under 22% on complex ones, exposing critical bottlenecks in reasoning, grounding, and long-horizon planning. This work establishes a reproducible, scalable evaluation paradigm for mobile agents.

Technology Category

Application Category

📝 Abstract

Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.

Problem

Research questions and friction points this paper is trying to address.

Lack of common benchmark for mobile device control agents

Need to evaluate agent performance across diverse device configurations

Agents struggle with complex tasks despite simple task proficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

B-MoCA benchmark for mobile control agents

Android-based tasks with randomization feature

Evaluates LLM and imitation learning agents

🔎 Similar Papers

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

2024-05-23arXiv.orgCitations: 33