A3: Android Agent Arena for Mobile GUI Agents

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing evaluation methods rely on static screenshots, limiting their ability to assess AI assistants’ capability in solving complex GUI tasks within realistic mobile environments—lacking both automation and ecological validity. To address this, we propose A3, an Android agent evaluation platform. Methodologically, A3 (1) establishes a dynamic interaction benchmark covering 21 mainstream mobile applications and 201 representative user tasks; (2) introduces the first business-level, LLM-driven automated evaluation framework, enabling cross-app task modeling, real-time information retrieval, and generalization over action spaces; and (3) supports end-to-end evaluation with zero coding and minimal human effort. Empirically, A3 significantly enhances reproducibility, cross-platform compatibility, and practical performance validation of mobile GUI agents under realistic operating conditions.

Technology Category

Application Category

📝 Abstract

AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at url{https://yuxiangchai.github.io/Android-Agent-Arena/}.

Problem

Research questions and friction points this paper is trying to address.

Mobile AI Assistants

Complex Problem Solving

Automated Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Android Agent Arena

Real-life Problem Solving

Automated Evaluation

🔎 Similar Papers

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

2024-07-03arXiv.orgCitations: 28

Benchmarking Mobile Device Control Agents across Diverse Configurations

2024-04-25arXiv.orgCitations: 16