🤖 AI Summary
This paper addresses key challenges in GUI-based fundamental tasks—including poor cross-platform generalization, difficulty parsing complex layouts, and inaccurate fine-grained element localization—by proposing ZoomClick, a training-free localization method leveraging zoom as a strong geometric prior. ZoomClick is the first to systematically model four intrinsic zoom properties: pre-zooming, depth, shrinkage ratio, and minimal cropping. It introduces dynamic spatial focusing and adaptive context switching, integrated with multi-scale cropping and hierarchical click strategies, enabling zero-shot fine-grained element localization. Contributions include: (1) the first training-free augmentation framework explicitly designed for GUI zoom characteristics; (2) the open-source evaluation benchmark GUIZoom-Bench; and (3) state-of-the-art performance on established benchmarks—e.g., achieving 73.1% accuracy on UI-Venus-72B, surpassing prior methods on ScreenSpot-Pro and setting a new SOTA.
📝 Abstract
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.