Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Accurately grounding natural language instructions to pixel coordinates in visually homogeneous and densely packed graphical user interfaces (GUIs) remains highly challenging. This work proposes a co-evolutionary “Propose-then-Critic” framework that replaces static consistency strategies by rendering candidate click points on GUI screenshots and dynamically evaluating them with a learnable visual critic. The approach integrates maturity-aware adaptive co-evolutionary reinforcement learning, dynamic training objective balancing, and spatial exploration optimization to mutually enhance the capabilities of both proposer and critic, thereby improving generalization. Evaluated across six benchmark datasets, the method substantially advances GUI grounding accuracy and critic reliability.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model's grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer's outputs enhances critic robustness, while the critic's maturing discrimination capability conversely unlocks the proposer's potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

precise localization

visually homogeneous elements

dense layouts

natural language instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI grounding

co-evolutionary reinforcement learning

visual critic