🤖 AI Summary
To address three key challenges in GUI agent training—scarce high-quality data, prohibitively high annotation costs, and weak cross-device privacy protection—this paper proposes Step-GUI, a novel framework featuring a calibration-based step-level reward mechanism and the GUI-MCP contextual protocol. These innovations enable trajectory-level self-evolving annotation, on-device execution with strong privacy guarantees, and hierarchical task scheduling grounded in real-world scenarios. Step-GUI integrates multimodal large language models, hierarchical GUI action abstraction, and localized expert collaboration, and introduces AndroidDaily—the first behavior-driven, real-world Android benchmark. Experiments demonstrate that Step-GUI 8B achieves state-of-the-art performance: 80.2% on AndroidWorld, 48.5% on OSWorld, and 62.6% on ScreenShot-Pro. On AndroidDaily, it attains an end-to-end accuracy of 52.50%, annotation accuracy exceeding 90%, and reduces annotation cost by 10–100× compared to conventional approaches.
📝 Abstract
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.