GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenges of sparse rewards and credit assignment in long-horizon GUI tasks—which undermine agent reliability—this paper introduces the first process-supervision framework for GUI agents. It proposes a unified model that jointly performs process reward modeling and reasoning-based verification. Trained on a dataset comprising 52K human-annotated quality scores and GPT-4o-generated attribution rationales, the high-fidelity process reward model supports both PPO-based reinforcement learning optimization and dynamic decision verification at inference time. On the AndroidWorld benchmark, integrating the model as a reward function improves task success rate by 7.7 percentage points, while using it as a verifier yields a 5.1-point gain; on AndroidControl, the corresponding improvements are 2.2 and 4.3 points, respectively. This work pioneers systematic investigation of process supervision for GUI agents, establishing a new paradigm for fine-grained behavioral guidance and trustworthy evaluation.

Technology Category

Application Category

📝 Abstract

Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.

Problem

Research questions and friction points this paper is trying to address.

Addresses sparse rewards in long-sequence GUI tasks

Solves intractable credit assignment for GUI agents

Provides process supervision for reliable GUI automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model providing dense step-by-step feedback

Trained on human-annotated scores and GPT-4o rationales

Serves as both RL reward provider and inference verifier

🔎 Similar Papers

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation