Step-GUI Technical Report

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address three key challenges in GUI agent training—scarce high-quality data, prohibitively high annotation costs, and weak cross-device privacy protection—this paper proposes Step-GUI, a novel framework featuring a calibration-based step-level reward mechanism and the GUI-MCP contextual protocol. These innovations enable trajectory-level self-evolving annotation, on-device execution with strong privacy guarantees, and hierarchical task scheduling grounded in real-world scenarios. Step-GUI integrates multimodal large language models, hierarchical GUI action abstraction, and localized expert collaboration, and introduces AndroidDaily—the first behavior-driven, real-world Android benchmark. Experiments demonstrate that Step-GUI 8B achieves state-of-the-art performance: 80.2% on AndroidWorld, 48.5% on OSWorld, and 62.6% on ScreenShot-Pro. On AndroidDaily, it attains an end-to-end accuracy of 52.50%, annotation accuracy exceeding 90%, and reduces annotation cost by 10–100× compared to conventional approaches.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

Problem

Research questions and friction points this paper is trying to address.

Efficiently acquire high-quality GUI training data with reliable annotation.

Develop GUI agents with robust performance across diverse platforms and tasks.

Enable practical GUI automation deployment with privacy protection and standardized interfaces.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving pipeline with calibrated step reward system

Step-GUI models achieve state-of-the-art GUI performance

GUI-MCP protocol enables high-privacy on-device execution

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces

2024-05-05Citations: 1

Identifying User Goals from UI Trajectories

2024-06-20arXiv.orgCitations: 3