MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses key challenges in evaluating cross-platform (Windows/macOS/Linux/iOS/Android/Web) GUI automation agents—namely, difficulty in performance assessment, low task efficiency, and excessive redundant actions. We propose the first hierarchical evaluation framework, spanning four capability levels: interface understanding, element localization, task automation, and collaborative reasoning. We introduce the Efficiency-Quality Area (EQA) metric to quantify online execution efficacy. Empirical analysis identifies precise visual localization as the primary bottleneck for task success and validates that modular architecture and early-stopping strategies are critical for efficiency gains. By integrating accurate localization, long-context memory, large action spaces, and long-horizon reasoning, the framework enables cross-platform generalization and joint modeling. Experiments demonstrate that coupling strong planning capabilities with a dedicated localization module significantly reduces redundancy and improves success rates—establishing a systematic, empirically grounded evaluation paradigm for scalable, high-performance GUI agents.

Technology Category

Application Category

📝 Abstract
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GUI automation agents across multiple platforms
Assessing GUI agent efficiency with EQA metric
Improving task success via precise visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multi-platform GUI evaluation framework
Efficiency-Quality Area metric for automation assessment
Modular frameworks with precise grounding integration
🔎 Similar Papers
No similar papers found.
Xuehui Wang
Xuehui Wang
PhD Candidate, Shanghai Jiao Tong University
Computer VisionSegmentationDetection
Z
Zhenyu Wu
Shanghai Jiao Tong University, Shanghai AI Laboratory
J
JingJing Xie
Xiamen University, Shanghai AI Laboratory
Zichen Ding
Zichen Ding
Shanghai AI Laboratory
Computer-Use AgentAI AgentsLarge Language Models
B
Bowen Yang
University of Science and Technology of China, Shanghai AI Laboratory
Zehao Li
Zehao Li
Peking University
Operations researchStochastic approximation
Zhaoyang Liu
Zhaoyang Liu
Tongyi Lab, Alibaba Group
LLMRecommendation
Qingyun Li
Qingyun Li
University of Electronic Science and Technology of China
wireless communicationsinformation theory
Xuan Dong
Xuan Dong
Associate Professor of Beijing University of Posts and Telecommunications
Computer Vision
Z
Zhe Chen
Nanjing University, Shanghai AI Laboratory
Weiyun Wang
Weiyun Wang
Shanghai AI Laboratory; Fudan University
Vision-Language ModelMLLMFoundation Model
X
Xiangyu Zhao
Shanghai Jiao Tong University, Shanghai AI Laboratory
Jixuan Chen
Jixuan Chen
UC San Diego
Multimodal agentsNatural language processingMachine learning
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
Tianbao Xie
Tianbao Xie
University of Hong Kong
Artificial IntelligenceDeep LearningNatural Language Processing
C
Chenyu Yang
Shanghai AI Laboratory
Shiqian Su
Shiqian Su
PhD student, Tsinghua University
Large Language ModelEmbodied IntelligenceMultimodal models
Y
Yue Yu
Tsinghua University
Y
Yuan Huang
Y
Yiqian Liu
X
Xiao Zhang
Yanting Zhang
Yanting Zhang
Donghua University
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning
Weijie Su
Weijie Su
Associate Professor, University of Pennsylvania
Machine LearningDifferential PrivacyHigh-Dimensional StatisticsOptimizationDeep Learning
Xizhou Zhu
Xizhou Zhu
Tsinghua University