Learning GUI Grounding with Spatial Reasoning from Visual Feedback

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the challenge of precise coordinate prediction by vision-language models (VLMs) on high-resolution, complex-layout GUI images, this work proposes an interactive cursor search paradigm: reformulating GUI localization as a multi-step cursor movement task guided by visual feedback, replacing conventional end-to-end coordinate regression. Methodologically, we build upon Qwen2.5-VL-7B and integrate spatial relation modeling with a trajectory-aware dense-reward reinforcement learning framework to enable online multi-step policy optimization. Our key contribution is a visual-feedback-driven interactive reasoning mechanism that significantly improves the model’s understanding of screen element spatial structure and robustness in localization. Experiments demonstrate state-of-the-art performance: 93.9% accuracy (+5.1%) on ScreenSpot-v2 and 56.5% (+29.7%) on ScreenSpot-Pro; moreover, 95% of tasks are completed within two steps.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8% ightarrow 93.9%$) and ScreenSpot-Pro ($26.8% ightarrow 56.5%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95% of instances and can adaptively conduct more steps on more difficult examples.

Problem

Research questions and friction points this paper is trying to address.

Reframing GUI grounding as interactive search task

Addressing inaccurate coordinate prediction in complex layouts

Improving accuracy through multi-step visual feedback learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reframes GUI grounding as interactive search task

Uses multi-step reinforcement learning with visual feedback

Moves cursor based on spatial relations and movement history

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces