🤖 AI Summary
Existing GUI grounding models exhibit high sensitivity to visual perturbations—such as minor cropping—leading to substantial fluctuations in pixel-level localization, particularly under high-resolution settings with small UI elements, thereby severely compromising robustness. To address this, we propose a training-free multi-view inference framework: it leverages attention-guided generation of diverse local views and applies spatial density clustering over predicted coordinates from each view to automatically detect and reject outliers; the centroid of the densest cluster serves as the final localization output. The method requires no model fine-tuning and is architecture-agnostic. Evaluated on the ScreenSpot-Pro benchmark, it boosts the accuracy of Qwen3VL-32B-Instruct to 74.0%, significantly improving cross-model generalization and stability of high-precision localization.
📝 Abstract
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.