extsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from unreliable visual grounding in GUI tasks, hindering precise localization of interface elements required for pointer-level interaction. To address this, we propose an adaptive iterative focusing framework that dynamically orchestrates specialized tools—including region cropping, zooming, and OCR—to progressively refine the model’s attention region, enabling fine-grained visual localization. Our method jointly leverages image grounding reasoning and multimodal language understanding. Evaluated on the ScreenSpot-Pro benchmark, it achieves 52.8% accuracy using only 18.5K annotated samples—significantly outperforming prior approaches reliant on million-scale supervision. The core contribution lies in a lightweight, interpretable iterative refinement mechanism that eliminates dependence on massive labeled data, thereby enhancing both the robustness and practical applicability of MLLMs in real-world GUI environments.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GTA-1-7B (50.1% with 1.56M training samples).

Problem

Research questions and friction points this paper is trying to address.

Improves visual grounding accuracy for GUI interactions

Enables precise pointer-level actions like clicking and dragging

Dynamically narrows focus to relevant screen regions iteratively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively narrows focus to relevant screen regions

Dynamically invokes multiple specialized tools

Improves visual grounding accuracy with minimal training

🔎 Similar Papers

No similar papers found.

Authors to Follow