AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating mobile GUI agents in real-world, closed-source applications, where existing benchmarks relying on simulated or open-source environments fall short in enabling automatic and verifiable assessment. To bridge this gap, the authors introduce AndroidDaily, a large-scale benchmark comprising 94 high-frequency Android applications and 350 everyday tasks, along with GRADE, a novel evaluation framework. GRADE enables process-aware, automatic diagnosis of long-horizon interaction trajectories in closed-source apps without requiring access to internal states, by leveraging three external, observable criteria: action fidelity, output quality, and negative constraints. Experiments demonstrate that GRADE achieves 87.37% agreement with human evaluators, while the strongest existing agent attains only a 62.0% task success rate on AndroidDaily, revealing substantial limitations of current approaches in realistic settings.
📝 Abstract
The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent's visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37\% agreement with human evaluators. The strongest model reaches a 62.0\% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.
Problem

Research questions and friction points this paper is trying to address.

mobile GUI agents
closed-source applications
automatic evaluation
benchmark
real-world tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRADE
verifiable evaluation
closed-source applications
mobile GUI agents
external guidelines
Yifan Sui
Yifan Sui
Shanghai Jiao Tong University
serverless computingcloud computingmachine learning system
X
Xin Huang
StepFun
H
Hongbing Li
Beijing University of Posts and Telecommunications
Fang Xu
Fang Xu
Wuhan University
Image Processing
J
Jiahe Lv
StepFun
H
Haolong Yan
Beijing University of Posts and Telecommunications
Y
Yeqing Shen
StepFun
L
Litao Liu
StepFun
Z
Zhimin Fan
StepFun
Z
Ziyang Meng
StepFun
J
Jia Wang
StepFun
J
Junbo Qi
StepFun
K
Kaijun Tan
StepFun
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Xiangyu Zhang
Xiangyu Zhang
Co-founder & Chief Scientist of StepFun
Neural Network ArchitecturesEfficient Deep LearningComputer Vision
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
Osamu Yoshie
Osamu Yoshie
waseda university