UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the challenge of natural language grounding in graphical user interfaces (GUIs), where small icons and dense layouts hinder accurate localization. The authors propose a training-free, adaptive scaling framework that formulates scale selection and cropping radius determination as an uncertainty quantification problem. By leveraging confidence gating and variance decomposition, the method enables dynamic, instance-specific scaling on demand. It seamlessly integrates into existing models through spatial consistency fusion, token-level confidence estimation, and a cropping radius derived via the law of total variance. Evaluated on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2, the approach achieves performance gains of 13.4%, 10.3%, and 4.2% respectively, substantially outperforming strong baselines.

Technology Category

Application Category

📝 Abstract
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
Problem

Research questions and friction points this paper is trying to address.

GUI grounding
uncertainty quantification
adaptive zoom-in
small icons
dense layouts
Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty quantification
adaptive zoom-in
GUI grounding
confidence-aware gating
training-free refinement