🤖 AI Summary
To address the challenges of sparse rewards and credit assignment in long-horizon GUI tasks—which undermine agent reliability—this paper introduces the first process-supervision framework for GUI agents. It proposes a unified model that jointly performs process reward modeling and reasoning-based verification. Trained on a dataset comprising 52K human-annotated quality scores and GPT-4o-generated attribution rationales, the high-fidelity process reward model supports both PPO-based reinforcement learning optimization and dynamic decision verification at inference time. On the AndroidWorld benchmark, integrating the model as a reward function improves task success rate by 7.7 percentage points, while using it as a verifier yields a 5.1-point gain; on AndroidControl, the corresponding improvements are 2.2 and 4.3 points, respectively. This work pioneers systematic investigation of process supervision for GUI agents, establishing a new paradigm for fine-grained behavioral guidance and trustworthy evaluation.
📝 Abstract
Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.