R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing GUI automation methods suffer from low cross-platform interface element grounding accuracy—particularly on large, cluttered screenshots—where visual GUI agents are easily distracted by irrelevant regions. To address this, we propose a region-aware vision-language grounding framework. Our approach introduces three key innovations: (1) a region-aware scaling proposal mechanism that dynamically focuses on candidate regions; (2) an IoU-aware multimodal loss function that explicitly models spatial alignment between predicted and ground-truth bounding boxes; and (3) a unified architecture integrating a region proposal network with vision-language pre-trained models, enhanced by a multimodal alignment optimization strategy. Evaluated on the ScreenSpot and AgentStudio benchmarks, our method achieves a 13% improvement in grounding accuracy. Furthermore, it yields absolute gains of 3.2–9.7% in task success rate on the AITW and Mind2Web web navigation benchmarks, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Precise grounding of GUI elements across diverse platforms

Reducing irrelevant information in vision-only GUI agents

Improving grounding accuracy with IoU-aware objective function

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses zoomed-in region proposals for precise localization

Introduces IoU-aware objective function for better convergence

Combines VLMs with object detection techniques

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces