AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge that vision-language models (VLMs) struggle to precisely localize small interactive elements in high-resolution graphical user interfaces (GUIs) due to input resolution constraints, which degrades their ability to map instructions to accurate coordinates. To overcome this limitation, the authors propose a training-free active visual search framework that leverages token-level perplexity to quantify anisotropic spatial uncertainty, generating multiple coordinate hypotheses and constructing a Gaussian probability field to guide both global and local region proposals. By integrating shape-aware scaling with visual-prompt-driven consistency aggregation, the method achieves orientation-sensitive, adaptive fine-grained localization. Experiments demonstrate significant improvements in localization accuracy across diverse general-purpose and GUI-specific VLMs on the ScreenSpot-Pro and ScreenSpot-V2 benchmarks.

📝 Abstract

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

spatial uncertainty

high-resolution interfaces

visual search

resolution gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty-aware

active visual search

GUI grounding