🤖 AI Summary
Existing vision-only GUI agents rely on offline training, facing two major bottlenecks: high human annotation costs and poor adaptability to dynamic interactive environments. To address these, we propose the first VLM-driven, zero-annotation online autonomous learning framework. Our method leverages large vision-language models for environment perception, automatic task generation, and unsupervised reward modeling; it further introduces a two-stage online reinforcement learning mechanism enabling continuous policy optimization and real-time environmental interaction. Crucially, the framework eliminates dependence on manual annotations and handcrafted evaluation functions. Evaluated on OSWorld and AndroidLab benchmarks, it significantly improves the generalization capability and task completion rates of UI-TARS and Aguvis. To our knowledge, this is the first work achieving fully autonomous, online evolution of GUI agents within real-world, dynamically changing interfaces.
📝 Abstract
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.