Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

πŸ“… 2025-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multimodal large language models (MLLMs) used as reward models for web navigation suffer from slow inference, high computational cost, and insufficient domain adaptation. Method: We propose the first step-level process reward model (PRM) tailored to web interaction trajectories. Our approach comprises: (1) constructing WebPRM Collectionβ€”a 40K-step preference dataset annotated at the action-step level; (2) designing WebRewardBench, a dedicated meta-evaluation benchmark for PRMs; and (3) training a lightweight PRM via preference learning, augmented with human-designed checklists, cross-domain trajectory comparison, and a hybrid verification framework integrating GPT-4o-mini and Web-Shepherd. Results: On WebRewardBench, our PRM achieves ~30-point higher accuracy than GPT-4o; on WebArena-lite, it improves task success rate by 10.9 points while reducing inference cost by 10Γ— compared to standard MLLM-based reward models.

Technology Category

Application Category

πŸ“ Abstract
Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized reward models for web navigation tasks
High cost and inefficiency of using MLLMs as reward models
Need for step-level assessment in web navigation trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed step-level process reward model Web-Shepherd
Created large-scale dataset WebPRM Collection
Introduced meta-evaluation benchmark WebRewardBench
πŸ”Ž Similar Papers