GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

📅 2024-06-16
📈 Citations: 20
Influential: 3
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant deficiencies in dynamic GUI understanding—particularly for multi-step interactions, multi-window coordination, and dynamically rendered web content. To address this gap, we introduce GUI-World, the first benchmark dedicated to dynamic GUI video understanding, encompassing six GUI scenarios, eight question types, and three annotation formats. We conduct the first systematic evaluation of MLLMs on GUI video comprehension and propose GUI-Vid—the first Video LLM specifically fine-tuned for GUI tasks—revealing the critical roles of keyframes and interaction history. Leveraging human-in-the-loop annotation, we construct a high-quality dataset and design a multi-granularity evaluation framework. Experiments demonstrate that mainstream Video LLMs underperform substantially on sparse GUI videos; GUI-Vid achieves significant performance gains, yet fundamental limitations of underlying base models hinder end-to-end GUI agent practicality.

Technology Category

Application Category

📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that current models struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, Video LLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Therefore, we take the initial step of leveraging a fine-tuned Video LLM, GUI-Vid, as a GUI-oriented assistant, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using video LLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. All the dataset and code are publicly available at: https://gui-world.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to understand dynamic GUI content.
Addressing limitations of current agents in multi-step GUI tasks.
Introducing GUI-World dataset for diverse GUI scenario analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GUI-World dataset with Human-MLLM annotations
Evaluates MLLMs on dynamic and sequential GUI content
Proposes fine-tuned Video LLM GUI-Vid as assistant
🔎 Similar Papers
No similar papers found.
D
Dongping Chen
Huazhong University of Science and Technology
Y
Yue Huang
University of Notre Dame
S
Siyuan Wu
Huazhong University of Science and Technology
J
Jingyu Tang
Huazhong University of Science and Technology
L
Liuyi Chen
Huazhong University of Science and Technology
Y
Yilin Bai
Huazhong University of Science and Technology
Zhigang He
Zhigang He
Huazhong University of Science and Technology
C
Chenlong Wang
Huazhong University of Science and Technology
Huichi Zhou
Huichi Zhou
University College London
AI4Science
Y
Yiqiang Li
Huazhong University of Science and Technology
T
Tianshuo Zhou
Huazhong University of Science and Technology
Y
Yue Yu
Huazhong University of Science and Technology
Chujie Gao
Chujie Gao
iSURE visiting student, University of Notre Dame
Qihui Zhang
Qihui Zhang
Peking University
Human AlignmentMulti-ModalityLarge Language Model
Yi Gui
Yi Gui
Huazhong University of Science and Technology
Z
Zhen Li
Huazhong University of Science and Technology
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
P
Pan Zhou
Huazhong University of Science and Technology
J
Jianfeng Gao
Microsoft Research
L
Lichao Sun
Lehigh University