Improved GUI Grounding via Iterative Narrowing

📅 2024-11-18

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the low localization accuracy and poor cross-platform robustness of general-purpose vision-language models (e.g., GPT-4V) on zero-shot GUI grounding tasks, this paper proposes a parameter-free, vision-prompt-driven iterative refinement mechanism. Our method employs multi-round spatial attention focusing guided by lightweight visual prompt engineering to dynamically narrow down candidate regions, thereby significantly enhancing fine-grained UI element localization. Evaluated on a unified, multi-platform benchmark covering Web, Android, and iOS interfaces, our approach improves GPT-4V’s GUI grounding accuracy by an average of 12.7%. It is compatible with both generic and fine-tuned vision-language models. To foster reproducibility and further research, we fully open-source our code, evaluation framework, and multi-platform annotated datasets.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

Problem

Research questions and friction points this paper is trying to address.

Enhancing GUI grounding for Vision-Language Models

Improving zero-shot GUI grounding via fine-tuning

Introducing iterative narrowing for better GUI performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative narrowing mechanism for GUI grounding

Visual prompting framework enhances VLM performance

Fine-tuned models for zero-shot GUI grounding

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces