🤖 AI Summary
Multimodal large language models (MLLMs) suffer from unreliable visual grounding in GUI tasks, hindering precise localization of interface elements required for pointer-level interaction. To address this, we propose an adaptive iterative focusing framework that dynamically orchestrates specialized tools—including region cropping, zooming, and OCR—to progressively refine the model’s attention region, enabling fine-grained visual localization. Our method jointly leverages image grounding reasoning and multimodal language understanding. Evaluated on the ScreenSpot-Pro benchmark, it achieves 52.8% accuracy using only 18.5K annotated samples—significantly outperforming prior approaches reliant on million-scale supervision. The core contribution lies in a lightweight, interpretable iterative refinement mechanism that eliminates dependence on massive labeled data, thereby enhancing both the robustness and practical applicability of MLLMs in real-world GUI environments.
📝 Abstract
Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GTA-1-7B (50.1% with 1.56M training samples).