AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing benchmarks for mobile GUI agents, which typically assume fully specified and unambiguous instructions, thereby failing to capture the real-world challenge of handling vague directives that require interactive clarification. To bridge this gap, the authors propose MUSE—the first evaluation benchmark that supports multi-level instruction clarity—grounded in cognitive gap theory and structured into four distinct clarity categories. They introduce a novel evaluation paradigm centered on bidirectional intent alignment and develop an automated, multimodal large language model–driven multi-agent framework with a rigorous validation protocol. This yields a high-ecological-validity dataset of 240 tasks. Experiments reveal performance boundaries of state-of-the-art agents across clarity levels, quantify the benefits of proactive interaction, and demonstrate strong agreement between MUSE scores and human judgments.

Technology Category

Application Category

📝 Abstract
Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.
Problem

Research questions and friction points this paper is trying to address.

Mobile GUI Agents
Ambiguous Instructions
Intent Alignment
Benchmarking
User Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

AmbiBench
instruction clarity taxonomy
intent alignment
MUSE
mobile GUI agents
🔎 Similar Papers
J
Jiazheng Sun
Fudan University, China
M
Mingxuan Li
Fudan University, China
Yingying Zhang
Yingying Zhang
East China Normal University
Subgroup AnalysisQuantile RegressionTensor LearningReinforcement Learning
J
Jiayang Niu
Fudan University, China
Y
Yachen Wu
Fudan University, China
R
Ruihan Jin
Fudan University, China
S
Shuyu Lei
Fudan University, China
P
Pengrongrui Tan
Fudan University, China
Zongyu Zhang
Zongyu Zhang
Zhejiang University
Singal ProcessingPerformance analysis
R
Ruoyi Wang
Fudan University, China
J
Jiachen Yang
Fudan University, China
B
Boyu Yang
Fudan University, China
J
Jiacheng Liu
Fudan University, China
Xin Peng
Xin Peng
East China University of Science and Technology
Artificial IntelligenceMachine LearningComplex Process Modeling