Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

The core bottleneck in GUI-based fundamental tasks lies in reliable patch-to-pixel mapping—especially when generalizing to unseen high-resolution screens. Existing approaches model spatial coordinates as text tokens, forcing models to implicitly learn complex position-to-pixel relationships, resulting in poor generalization and degraded localization accuracy. To address this, we propose RULER tokens: explicit, learnable coordinate representations, coupled with Interleaved MRoPE (I-MRoPE), a spatial encoding mechanism that yields width- and height-symmetric, decoupled pixel-level positional embeddings. By aligning RULER tokens with visual features, the model acquires explicit spatial guidance, eliminating reliance on implicit coordinate mapping. Evaluated on the ScreenSpot benchmark suite, our method significantly improves cross-resolution localization accuracy—particularly under high-resolution interfaces—while enhancing robustness. This advances visual-language grounding for GUI automation across heterogeneous devices.

Technology Category

Application Category

📝 Abstract

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

Problem

Research questions and friction points this paper is trying to address.

Mapping natural language to pixel coordinates for GUI automation

Overcoming patch-to-pixel mapping failures on high-resolution displays

Improving coordinate accuracy across diverse screen resolutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses RULER tokens as explicit coordinate markers

Applies Interleaved MRoPE for balanced spatial encoding

Provides explicit spatial guidance for GUI grounding

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces