š¤ AI Summary
To address the low localization accuracy and poor cross-platform robustness of general-purpose vision-language models (e.g., GPT-4V) on zero-shot GUI grounding tasks, this paper proposes a parameter-free, vision-prompt-driven iterative refinement mechanism. Our method employs multi-round spatial attention focusing guided by lightweight visual prompt engineering to dynamically narrow down candidate regions, thereby significantly enhancing fine-grained UI element localization. Evaluated on a unified, multi-platform benchmark covering Web, Android, and iOS interfaces, our approach improves GPT-4Vās GUI grounding accuracy by an average of 12.7%. It is compatible with both generic and fine-tuned vision-language models. To foster reproducibility and further research, we fully open-source our code, evaluation framework, and multi-platform annotated datasets.
š Abstract
Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.