How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

General-purpose vision-language models (VLMs) struggle to directly output precise coordinates for GUI localization tasks, and existing fine-tuning approaches require large-scale annotated data—rendering them costly and impractical. Method: We propose a zero-shot auxiliary reasoning framework that activates VLMs’ intrinsic spatial understanding without fine-tuning. It introduces structured spatial priors via *spatial visualization prompting*: overlaying coordinate axes, grid lines, and labeled intersection points onto input GUI images to enable explicit spatial reasoning. The framework comprises three plug-and-play zero-shot strategies. Contribution/Results: Extensive evaluation across four GUI localization benchmarks with seven state-of-the-art VLMs demonstrates substantial gains in average localization accuracy, strong cross-model generalization, and effective task transferability. To our knowledge, this is the first work to incorporate structured spatial priors as prompts for zero-shot GUI localization, establishing a new paradigm for unlocking VLMs’ spatial capabilities in low-resource settings.

Technology Category

Application Category

📝 Abstract

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.

Problem

Research questions and friction points this paper is trying to address.

Addresses VLMs' struggle with GUI grounding tasks

Proposes zero-shot methods to reduce annotation costs

Enhances spatial understanding without fine-tuning VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot auxiliary reasoning methods

Spatial cues like axes and grids

Enhancing implicit spatial understanding capabilities

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces