GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current large vision-language models (VLMs) significantly underperform humans in GUI task automation, primarily due to the lack of systematic GUI knowledge. Method: We propose a three-dimensional knowledge frameworkβ€”*Interface Awareness*, *Interaction Prediction*, and *Instruction Understanding*β€”and introduce GUI Knowledge Bench, the first cross-platform, multi-application benchmark covering six major platforms and 292 applications. It employs a failure-mode-driven evaluation paradigm using multiple-choice and true/false questions. Contribution/Results: Experiments reveal that while VLMs can identify widget functionality, they exhibit substantial deficiencies in system-state awareness, interaction-behavior prediction, and task-completion judgment. Model performance strongly correlates with mastery of the three knowledge dimensions. GUI Knowledge Bench establishes the first standardized tool and theoretical foundation for evaluating GUI agents, facilitating model selection and knowledge-enhanced training.

Technology Category

Application Category

πŸ“ Abstract
Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Identifying missing GUI knowledge causing VLM performance gaps
Analyzing failure patterns in widget recognition and state transitions
Assessing VLMs' limitations in task verification and completion planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines three GUI knowledge dimensions for VLMs
Introduces GUI Knowledge Bench benchmark across platforms
Links GUI knowledge assessment to real task success
πŸ”Ž Similar Papers
No similar papers found.
Chenrui Shi
Chenrui Shi
Beijing Institute of Technology
anomaly detection
Z
Zedong Yu
State Key Laboratory of General Artificial Intelligence, BIGAI
Z
Zhi Gao
Beijing Institute of Technology
R
Ruining Feng
Tsinghua University
E
Enqi Liu
Beijing Institute of Technology
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Shenzhen MSU-BIT University
Liuyu Xiang
Liuyu Xiang
Beijing University of Posts and Telecommunications
Computer VisionReinforcement LearningLLM Agent
Z
Zhaofeng He
Beijing University of Posts and Telecommunications
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI