Beyond Pass or Fail: A Multi-dimensional Benchmark for Mobile UI Navigation

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing UI navigation evaluation methods focus solely on task success/failure, lacking fine-grained, automated assessment of underlying sub-processes—such as goal comprehension, knowledge-based planning, visual grounding, and instruction following—and suffer from insufficient dataset and tool robustness. This paper introduces Sphinx, the first multi-dimensional, automated benchmark for mobile UI navigation. Sphinx innovatively integrates invariant validation, knowledge probing, vision-language alignment evaluation, instruction-following quantification, and multimodal behavioral analysis. Its evaluation suite enables fully automated, reproducible, cross-application testing. Experiments across eight large language and multimodal models under 13 configurations reveal that no model achieves end-to-end navigation success. Sphinx systematically exposes structural deficiencies across all core sub-capabilities, providing granular, actionable insights into current model limitations.

Technology Category

Application Category

📝 Abstract
Navigating mobile User Interface (UI) applications using large language and vision models based on high-level goal instructions is emerging as an important research field with significant practical implications, such as digital assistants and automated UI testing. To evaluate the effectiveness of existing models in mobile UI navigation, benchmarks are required and widely used in the literature. Although multiple benchmarks have been recently established for evaluating functional correctness being judged as pass or fail, they fail to address the need for multi-dimensional evaluation of the entire UI navigation process. Furthermore, other exiting related datasets lack an automated and robust benchmarking suite, making the evaluation process labor-intensive and error-prone. To address these issues, in this paper, we propose a new benchmark named Sphinx for multi-dimensional evaluation of existing models in practical UI navigation. Sphinx provides a fully automated benchmarking suite that enables reproducibility across real-world mobile apps and employs reliable evaluators to assess model progress. In addition to functional correctness, Sphinx includes comprehensive toolkits for multi-dimensional evaluation, such as invariant-based verification, knowledge probing, and knowledge-augmented generation to evaluate model capabilities including goal understanding, knowledge and planning, grounding, and instruction following, ensuring a thorough assessment of each sub-process in mobile UI navigation. We benchmark 8 large language and multi-modal models with 13 different configurations on Sphinx. Evaluation results show that all these models struggle on Sphinx, and fail on all test generation tasks. Our further analysis of the multi-dimensional evaluation results underscores the current progress and highlights future research directions to improve a model's effectiveness for mobile UI navigation.
Problem

Research questions and friction points this paper is trying to address.

Evaluation Standards
Large-scale Model Assessment
Automated Testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sphinx Evaluation Standard
Multimodal Model Assessment
Mobile UI Navigation
🔎 Similar Papers
No similar papers found.
Dezhi Ran
Dezhi Ran
School of Computer Science, Peking University
Short Video StreamingSoftware TestingProgram Analysis
Mengzhou Wu
Mengzhou Wu
Peking University
Software EngineeringLarge Language Model
H
Hao Yu
School of Software and Microelectronics, Peking University, Beijing, China
Y
Yuetong Li
The University of Chicago, Chicago, USA
J
Jun Ren
University of Texas at Dallas, Dallas, USA
Y
Yuan Cao
School of EECS, Peking University, Beijing, China
X
Xia Zeng
Tencent Inc., Shenzhen, China
H
Haochuan Lu
Tencent Inc., Shenzhen, China
Z
Zexin Xu
University of Texas at Dallas, Dallas, USA
M
Mengqian Xu
East China Normal University, Shanghai, China
T
Ting Su
East China Normal University, Shanghai, China
L
Liangchao Yao
Tencent Inc., Shenzhen, China
T
Ting Xiong
Tencent Inc., Shenzhen, China
W
Wei Yang
University of Texas at Dallas, Dallas, USA
Y
Yuetang Deng
Tencent Inc., Shenzhen, China
Assaf Marron
Assaf Marron
Weizmann Institute of Science
Software Engineeringformal methodscomputer scienceprogrammingbiological modeling
David Harel
David Harel
Professor of Computer Science, The Weizmann Institute
computer sciencesystems biology
T
Tao Xie
Key Lab of HCST (PKU), MOE; SCS, Peking University, Beijing, China