🤖 AI Summary
Addressing the challenges of precise GUI element localization from natural language instructions in high-resolution, complex interfaces—coupled with poor generalization and high data dependency of supervised fine-tuning—this paper proposes a self-evolving reinforcement fine-tuning framework. The method integrates seed data filtering, dense policy gradient optimization, and attention-map-driven iterative self-evolution, eliminating the need for human-annotated augmentation and enabling efficient training with only 3K samples. Built upon vision-language joint modeling, our 7B model achieves 47.3% accuracy on ScreenSpot-Pro, substantially outperforming UI-TARS-72B (+24.2%) and ranking first among same-scale models across three major benchmarks. This work is the first to demonstrate that small-scale models, guided by reinforcement-driven self-evolution, can surpass the performance of large models trained via conventional supervised fine-tuning.
📝 Abstract
Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.