AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

📅 2024-07-03
🏛️ arXiv.org
📈 Citations: 28
Influential: 5
📄 PDF
🤖 AI Summary
Existing mobile GUI-controlling agents suffer from a lack of large-scale, hierarchically annotated datasets supporting multi-step interaction and semantic understanding. To address this, we propose AMEX—the first high-quality, general-purpose dataset for mobile GUI control—comprising 104K high-resolution Android UI screenshots with three-tier collaborative annotations: (1) GUI element localization, (2) screen-level functional semantic parsing, and (3) alignment of natural language instructions with executable action sequences. This hierarchical annotation paradigm overcomes the limitations of conventional single-layer labeling, substantially improving task interpretability and cross-application generalization. Data are collected from real-world apps and meticulously annotated by human experts, integrating visual grounding, semantic parsing, and instruction–action alignment techniques. Fine-tuning the SPHINX Agent on AMEX demonstrates significant performance gains: our agent achieves markedly higher cross-app task completion rates and instruction-following accuracy compared to baselines. The AMEX dataset is publicly released and has been widely adopted by the research community.

Technology Category

Application Category

📝 Abstract
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.The project is available at https://yxchai.com/AMEX/.
Problem

Research questions and friction points this paper is trying to address.

Develop a dataset for mobile GUI-control AI agents
Provide multi-level annotations for mobile GUI interactions
Enhance AI agent capabilities in mobile task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multi-annotation mobile GUI dataset
Three-level detailed GUI and action annotations
Fine-tuned SPHINX Agent for GUI tasks
Yuxiang Chai
Yuxiang Chai
The Chinese University of Hong Kong
Computer VisionLLMAgent
S
Siyuan Huang
SJTU, Shanghai AI Lab
Y
Yazhe Niu
MMLab, CUHK, Shanghai AI Lab
H
Han Xiao
MMLab, CUHK
L
Liang Liu
vivo AI Lab
D
Dingyu Zhang
MMLab, CUHK
P
Peng Gao
Shanghai AI Lab
S
Shuai Ren
vivo AI Lab
H
Hongsheng Li
MMLab, CUHK