Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses key challenges in GUI-based fundamental tasks—including poor cross-platform generalization, difficulty parsing complex layouts, and inaccurate fine-grained element localization—by proposing ZoomClick, a training-free localization method leveraging zoom as a strong geometric prior. ZoomClick is the first to systematically model four intrinsic zoom properties: pre-zooming, depth, shrinkage ratio, and minimal cropping. It introduces dynamic spatial focusing and adaptive context switching, integrated with multi-scale cropping and hierarchical click strategies, enabling zero-shot fine-grained element localization. Contributions include: (1) the first training-free augmentation framework explicitly designed for GUI zoom characteristics; (2) the open-source evaluation benchmark GUIZoom-Bench; and (3) state-of-the-art performance on established benchmarks—e.g., achieving 73.1% accuracy on UI-Venus-72B, surpassing prior methods on ScreenSpot-Pro and setting a new SOTA.

Technology Category

Application Category

📝 Abstract
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
Problem

Research questions and friction points this paper is trying to address.

Unlocking zoom's potential for GUI grounding without training
Enhancing model performance via dynamic spatial focusing and context switching
Evaluating model adaptability to zoom with a new benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free zooming method for GUI grounding
Dynamic spatial focusing with adaptive context switching
Benchmark for evaluating zoom adaptability in GUIs
Z
Zhiyuan Jiang
Xi’an Jiaotong University
Shenghao Xie
Shenghao Xie
Ph.D. Student, AAIS, PKU
Computer VisionMachine Learning
W
Wenyi Li
University of Chinese Academy of Sciences
W
Wenqiang Zu
University of Chinese Academy of Sciences, Peking University
P
Peihang Li
The University of Hong Kong
Jiahao Qiu
Jiahao Qiu
Princeton University
LLMAI AgentsAI for X
S
Siqi Pei
Michigan State University
L
Lei Ma
Peking University
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing
M
Mengdi Wang
Princeton University
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent