VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI grounding benchmarks suffer from limited scale, platform specificity, and narrow domain coverage. To address these limitations, we propose MultiGUI-Bench—the first large-scale, multi-platform, bilingual GUI grounding benchmark—encompassing both mobile and desktop interfaces and supporting hierarchical evaluation from fundamental to advanced grounding tasks. We introduce a taxonomy of six complementary subtasks, design a high-precision cross-platform UI collection and alignment pipeline, and achieve fine-grained element annotation, bilingual semantic alignment, and a robustness-aware evaluation framework. Experimental results show that general-purpose multimodal models match or surpass specialized GUI models on basic tasks, while specialized models exhibit overfitting and poor generalization on advanced tasks. MultiGUI-Bench establishes a more comprehensive and challenging evaluation standard for GUI grounding research, enabling rigorous assessment across platforms, languages, and task complexities.

Technology Category

Application Category

📝 Abstract
GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in existing GUI grounding benchmarks
Introduces a comprehensive multi-platform bilingual GUI benchmark
Proposes hierarchical task taxonomy for evaluating grounding models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-platform benchmark with extensive application coverage
High-accuracy data construction pipeline for grounding tasks
Hierarchical task taxonomy dividing grounding into basic and advanced categories
🔎 Similar Papers
No similar papers found.
Beitong Zhou
Beitong Zhou
Huazhou University of Science and Technology
deep learningcomputer vision
Z
Zhexiao Huang
Venus Team, AntGroup
Y
Yuan Guo
Venus Team, AntGroup
Zhangxuan Gu
Zhangxuan Gu
Ant Group
computer vision
Tianyu Xia
Tianyu Xia
Ant Group
Differential Privacy
Z
Zichen Luo
Venus Team, AntGroup
F
Fei Tang
Venus Team, AntGroup
D
Dehan Kong
iMean AI
Y
Yanyi Shang
iMean AI
S
Suling Ou
Venus Team, AntGroup
Z
Zhenlin Guo
Venus Team, AntGroup
C
Changhua Meng
Venus Team, AntGroup
Shuheng Shen
Shuheng Shen
Ant Group
Machine LearningOptimizationPrivacy