GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of sparse rewards and credit assignment in long-horizon GUI tasks—which undermine agent reliability—this paper introduces the first process-supervision framework for GUI agents. It proposes a unified model that jointly performs process reward modeling and reasoning-based verification. Trained on a dataset comprising 52K human-annotated quality scores and GPT-4o-generated attribution rationales, the high-fidelity process reward model supports both PPO-based reinforcement learning optimization and dynamic decision verification at inference time. On the AndroidWorld benchmark, integrating the model as a reward function improves task success rate by 7.7 percentage points, while using it as a verifier yields a 5.1-point gain; on AndroidControl, the corresponding improvements are 2.2 and 4.3 points, respectively. This work pioneers systematic investigation of process supervision for GUI agents, establishing a new paradigm for fine-grained behavioral guidance and trustworthy evaluation.

Technology Category

Application Category

📝 Abstract
Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.
Problem

Research questions and friction points this paper is trying to address.

Addresses sparse rewards in long-sequence GUI tasks
Solves intractable credit assignment for GUI agents
Provides process supervision for reliable GUI automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model providing dense step-by-step feedback
Trained on human-annotated scores and GPT-4o rationales
Serves as both RL reward provider and inference verifier
C
Cong Chen
Zhejiang University, Ant Group
Kaixiang Ji
Kaixiang Ji
Ant Group
Computer VisionMultimodal
Hao Zhong
Hao Zhong
Professor, Shanghai Jiao Tong University
Software Engineering
Muzhi Zhu
Muzhi Zhu
Zhejiang University
Computer VisionMachine Learning
A
Anzhou Li
Zhejiang University, Ant Group
G
Guo Gan
Zhejiang University
Z
Ziyuan Huang
Ant Group
C
Cheng Zou
Ant Group
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jingdong Chen
Ant Group
H
Hao Chen
Zhejiang University
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning