Improved GUI Grounding via Iterative Narrowing

šŸ“… 2024-11-18
šŸ›ļø arXiv.org
šŸ“ˆ Citations: 3
✨ Influential: 0
šŸ“„ PDF

career value

204K/year
šŸ¤– AI Summary
To address the low localization accuracy and poor cross-platform robustness of general-purpose vision-language models (e.g., GPT-4V) on zero-shot GUI grounding tasks, this paper proposes a parameter-free, vision-prompt-driven iterative refinement mechanism. Our method employs multi-round spatial attention focusing guided by lightweight visual prompt engineering to dynamically narrow down candidate regions, thereby significantly enhancing fine-grained UI element localization. Evaluated on a unified, multi-platform benchmark covering Web, Android, and iOS interfaces, our approach improves GPT-4V’s GUI grounding accuracy by an average of 12.7%. It is compatible with both generic and fine-tuned vision-language models. To foster reproducibility and further research, we fully open-source our code, evaluation framework, and multi-platform annotated datasets.

Technology Category

Application Category

šŸ“ Abstract
Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
Problem

Research questions and friction points this paper is trying to address.

Enhancing GUI grounding for Vision-Language Models
Improving zero-shot GUI grounding via fine-tuning
Introducing iterative narrowing for better GUI performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative narrowing mechanism for GUI grounding
Visual prompting framework enhances VLM performance
Fine-tuned models for zero-shot GUI grounding