Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper addresses key challenges in GUI-based fundamental tasks—including poor cross-platform generalization, difficulty parsing complex layouts, and inaccurate fine-grained element localization—by proposing ZoomClick, a training-free localization method leveraging zoom as a strong geometric prior. ZoomClick is the first to systematically model four intrinsic zoom properties: pre-zooming, depth, shrinkage ratio, and minimal cropping. It introduces dynamic spatial focusing and adaptive context switching, integrated with multi-scale cropping and hierarchical click strategies, enabling zero-shot fine-grained element localization. Contributions include: (1) the first training-free augmentation framework explicitly designed for GUI zoom characteristics; (2) the open-source evaluation benchmark GUIZoom-Bench; and (3) state-of-the-art performance on established benchmarks—e.g., achieving 73.1% accuracy on UI-Venus-72B, surpassing prior methods on ScreenSpot-Pro and setting a new SOTA.

Technology Category

Application Category

📝 Abstract

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

Problem

Research questions and friction points this paper is trying to address.

Unlocking zoom's potential for GUI grounding without training

Enhancing model performance via dynamic spatial focusing and context switching

Evaluating model adaptability to zoom with a new benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free zooming method for GUI grounding

Dynamic spatial focusing with adaptive context switching

Benchmark for evaluating zoom adaptability in GUIs

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces