A3: Android Agent Arena for Mobile GUI Agents

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods rely on static screenshots, limiting their ability to assess AI assistants’ capability in solving complex GUI tasks within realistic mobile environments—lacking both automation and ecological validity. To address this, we propose A3, an Android agent evaluation platform. Methodologically, A3 (1) establishes a dynamic interaction benchmark covering 21 mainstream mobile applications and 201 representative user tasks; (2) introduces the first business-level, LLM-driven automated evaluation framework, enabling cross-app task modeling, real-time information retrieval, and generalization over action spaces; and (3) supports end-to-end evaluation with zero coding and minimal human effort. Empirically, A3 significantly enhances reproducibility, cross-platform compatibility, and practical performance validation of mobile GUI agents under realistic operating conditions.

Technology Category

Application Category

📝 Abstract
AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at url{https://yuxiangchai.github.io/Android-Agent-Arena/}.
Problem

Research questions and friction points this paper is trying to address.

Mobile AI Assistants
Complex Problem Solving
Automated Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Android Agent Arena
Real-life Problem Solving
Automated Evaluation
Yuxiang Chai
Yuxiang Chai
The Chinese University of Hong Kong
Computer VisionLLMAgent
Hanhao Li
Hanhao Li
香港中文大学
J
Jiayu Zhang
EE department @ CUHK
L
Liang Liu
vivo AI Lab
G
Guozhi Wang
vivo AI Lab
S
Shuai Ren
vivo AI Lab
S
Siyuan Huang
Shanghai Jiao Tong University
H
Hongsheng Li
MMLab @ CUHK