Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the challenge of precise instruction-grounded region localization in high-resolution, multi-element GUI images—where conventional vision-language models (VLMs) struggle with spatial grounding—this paper proposes LASER, a framework for multi-step active perception and adaptive reasoning. LASER dynamically selects high-confidence candidate regions via Monte Carlo–based quality estimation and guides attention toward task-critical areas using an IoU-driven regional evaluation mechanism; it further auto-scales the number of reasoning steps according to task complexity. Integrating preference optimization, multi-step perceptual modeling, and fine-grained coordinate prediction, LASER significantly improves localization accuracy. On the ScreenSpot Pro and ScreenSpot-v2 benchmarks, the GTA1-7B variant achieves 55.7 points—setting a new state-of-the-art for 7B-scale models—and empirically validates the effectiveness of self-evolving perception for GUI understanding.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

Problem

Research questions and friction points this paper is trying to address.

Enabling VLMs to reason over appropriate image regions

Addressing GUI grounding under high-resolution inputs

Handling complex multi-element visual interactions in VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving framework with multi-step perception capabilities

Monte Carlo and IoU-based quality evaluation combination

Adaptive reasoning steps allocation based on complexity

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces