🤖 AI Summary
General-purpose vision-language models (VLMs) struggle to directly output precise coordinates for GUI localization tasks, and existing fine-tuning approaches require large-scale annotated data—rendering them costly and impractical.
Method: We propose a zero-shot auxiliary reasoning framework that activates VLMs’ intrinsic spatial understanding without fine-tuning. It introduces structured spatial priors via *spatial visualization prompting*: overlaying coordinate axes, grid lines, and labeled intersection points onto input GUI images to enable explicit spatial reasoning. The framework comprises three plug-and-play zero-shot strategies.
Contribution/Results: Extensive evaluation across four GUI localization benchmarks with seven state-of-the-art VLMs demonstrates substantial gains in average localization accuracy, strong cross-model generalization, and effective task transferability. To our knowledge, this is the first work to incorporate structured spatial priors as prompts for zero-shot GUI localization, establishing a new paradigm for unlocking VLMs’ spatial capabilities in low-resource settings.
📝 Abstract
Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.