🤖 AI Summary
Existing reward models struggle to reliably distinguish fine-grained quality differences in knowledge-intensive and long-context tasks—especially when external evidence is unavailable. To address this, we propose OpenRM, a tool-augmented reward model that dynamically retrieves external evidence via tool calls to enable evidence-driven evaluation of open-ended generations. Methodologically, we introduce Group Relative Policy Optimization (GRPO), a joint optimization framework that simultaneously refines tool invocation policies and judgment accuracy; we further design a controllable data synthesis pipeline, generating over 27,000 high-quality preference pairs. Evaluated on three newly constructed knowledge-intensive benchmarks and two established ones, OpenRM consistently outperforms state-of-the-art reward models. Empirical results demonstrate that its evidence-grounded rewards significantly improve downstream large language models’ alignment performance—both during inference (e.g., best-of-N sampling) and training (e.g., RLHF).
📝 Abstract
Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.