Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Addressing the challenges of precise GUI element localization from natural language instructions in high-resolution, complex interfaces—coupled with poor generalization and high data dependency of supervised fine-tuning—this paper proposes a self-evolving reinforcement fine-tuning framework. The method integrates seed data filtering, dense policy gradient optimization, and attention-map-driven iterative self-evolution, eliminating the need for human-annotated augmentation and enabling efficient training with only 3K samples. Built upon vision-language joint modeling, our 7B model achieves 47.3% accuracy on ScreenSpot-Pro, substantially outperforming UI-TARS-72B (+24.2%) and ranking first among same-scale models across three major benchmarks. This work is the first to demonstrate that small-scale models, guided by reinforcement-driven self-evolution, can surpass the performance of large models trained via conventional supervised fine-tuning.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.

Problem

Research questions and friction points this paper is trying to address.

Improving GUI agent accuracy in visual grounding tasks

Addressing weak generalization in traditional supervised methods

Enhancing performance in high-resolution complex environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for GUI agents

Dense policy gradient for continuous feedback

Self-evolutionary finetuning using attention maps

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents