🤖 AI Summary
This work addresses the challenge of sparse and delayed supervision signals in long-horizon, irreversible web interactions, where conventional process reward models struggle to deliver fine-grained, interpretable, and robust feedback. To this end, we propose WebArbiter, a reasoning-centric approach to web process reward modeling that, for the first time, integrates principle-guided reasoning into reward judgment. By generating structured justifications and outputting preference scores, WebArbiter produces explainable, layout-invariant, and generalizable reward signals. The method employs a two-stage training paradigm: first, reasoning distillation imparts principle-guided capabilities, followed by reinforcement learning to align judgments with ground-truth correctness. Evaluated on our newly curated WebPRMBench benchmark, WebArbiter-7B outperforms GPT-5 by 9.1 points and surpasses the previous best WebPRM by 7.2 points on WebArena-Lite trajectory search, demonstrating its effectiveness and practical utility.
📝 Abstract
Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.