AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agent evaluation benchmarks suffer from narrow task coverage, low complexity, and coarse-grained metrics, failing to capture agents’ true capabilities in long-horizon, multi-step tasks. This paper introduces the first long-horizon evaluation framework for Android GUI agents, comprising 571 bilingual (Chinese–English) real-world tasks across 38 domains (avg. 26+ steps). It proposes a nested sub-goal-driven paradigm and a fine-grained metric—Average Task Progress (ATP)—to quantify incremental progress. The framework features a dual-mode architecture: static anomaly-preserving evaluation and dynamic milestone-based progress measurement, integrating GUI state graph modeling, automated milestone annotation, multi-path validation, and human-in-the-loop verification. Experiments reveal that state-of-the-art models achieve only 12.7% task success rate and 50.47% ATP, highlighting three fundamental bottlenecks: robustness to environmental anomalies, adaptive exploration, and long-range memory retention.

Technology Category

Application Category

📝 Abstract
Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
Problem

Research questions and friction points this paper is trying to address.

Evaluates mobile GUI agents on complex long-latency tasks
Addresses limitations in existing benchmarks for Android automation
Measures fine-grained progress in real-world user scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AndroidLens evaluation framework with 571 long-latency tasks
Features static evaluation preserving real-world anomalies and multiple valid paths
Uses dynamic milestone-based scheme with Average Task Progress for measurement
🔎 Similar Papers
No similar papers found.
Y
Yue Cao
Nanjing University
Yingyao Wang
Yingyao Wang
Alibaba Group, Harbin Institute of Technology
LVLMQuestion AnsweringKnowledge Reasoning
P
Pi Bu
Alibaba Group
J
Jingxuan Xing
Alibaba Group
W
Wei Jiang
Alibaba Group
Z
Zekun Zhu
Alibaba Group
J
Junpeng Ma
Fudan University, Alibaba Group
S
Sashuai Zhou
Zhejiang University, Alibaba Group
T
Tong Lu
Nanjing University
Jun Song
Jun Song
Shenzhen University
nanophotonics
Y
Yu Cheng
Alibaba Group
Y
Yuning Jiang
Alibaba Group
B
Bo Zheng
Alibaba Group