UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of natural language grounding in graphical user interfaces (GUIs), where small icons and dense layouts hinder accurate localization. The authors propose a training-free, adaptive scaling framework that formulates scale selection and cropping radius determination as an uncertainty quantification problem. By leveraging confidence gating and variance decomposition, the method enables dynamic, instance-specific scaling on demand. It seamlessly integrates into existing models through spatial consistency fusion, token-level confidence estimation, and a cropping radius derived via the law of total variance. Evaluated on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2, the approach achieves performance gains of 13.4%, 10.3%, and 4.2% respectively, substantially outperforming strong baselines.

Technology Category

Application Category

📝 Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

uncertainty quantification

adaptive zoom-in

small icons

dense layouts

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty quantification

adaptive zoom-in

GUI grounding