ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Multimodal large language models (MLLMs) for GUI interaction typically require extensive annotated data to achieve precise localization of interface elements. Method: We propose a highly data-efficient coordinate localization framework that introduces an online reinforcement learning paradigm—novelly integrating self-generated multi-step linguistic reasoning with spatially aware critique. We incorporate spatial equivariance priors for geometric consistency and design test-time spatial search coupled with coordinate aggregation to enhance generalization. Contribution/Results: Our approach achieves superior performance over the best open-source baselines using only 0.2% of the training samples. It significantly improves both accuracy and data efficiency across multiple GUI localization benchmarks. This work establishes a scalable, low-annotation-dependency paradigm for MLLM-driven fine-grained web interaction, advancing the frontier of efficient multimodal grounding in interactive interfaces.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).

Problem

Research questions and friction points this paper is trying to address.

Accurate GUI element localization for fine-grained actions

Data-efficient learning via self-generated reasoning and criticism

Improving web grounding performance with minimal training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generates language reasoning for localization

Uses spatial priors for prediction criticism

Combines spatial search with coordinate aggregation

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces