How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose vision-language models (VLMs) struggle to directly output precise coordinates for GUI localization tasks, and existing fine-tuning approaches require large-scale annotated data—rendering them costly and impractical. Method: We propose a zero-shot auxiliary reasoning framework that activates VLMs’ intrinsic spatial understanding without fine-tuning. It introduces structured spatial priors via *spatial visualization prompting*: overlaying coordinate axes, grid lines, and labeled intersection points onto input GUI images to enable explicit spatial reasoning. The framework comprises three plug-and-play zero-shot strategies. Contribution/Results: Extensive evaluation across four GUI localization benchmarks with seven state-of-the-art VLMs demonstrates substantial gains in average localization accuracy, strong cross-model generalization, and effective task transferability. To our knowledge, this is the first work to incorporate structured spatial priors as prompts for zero-shot GUI localization, establishing a new paradigm for unlocking VLMs’ spatial capabilities in low-resource settings.

Technology Category

Application Category

📝 Abstract
Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.
Problem

Research questions and friction points this paper is trying to address.

Addresses VLMs' struggle with GUI grounding tasks
Proposes zero-shot methods to reduce annotation costs
Enhances spatial understanding without fine-tuning VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot auxiliary reasoning methods
Spatial cues like axes and grids
Enhancing implicit spatial understanding capabilities
🔎 Similar Papers
No similar papers found.
Weiming Li
Weiming Li
Principal Engineer, Samsung Electronics
Computer VisionAugmented RealityComputational Imaging and Display
Yan Shao
Yan Shao
China Mobile, Hangzhou Research and Development Center, China
J
Jing Yang
Zhejiang Lab, Hangzhou, China
Y
Yujing Lu
Zhejiang Lab, Hangzhou, China
L
Ling Zhong
Zhejiang Lab, Hangzhou, China
Y
Yuhan Wang
Zhejiang Lab, Hangzhou, China
M
Manni Duan
Zhejiang Lab, Hangzhou, China