OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing reward models struggle to reliably distinguish fine-grained quality differences in knowledge-intensive and long-context tasks—especially when external evidence is unavailable. To address this, we propose OpenRM, a tool-augmented reward model that dynamically retrieves external evidence via tool calls to enable evidence-driven evaluation of open-ended generations. Methodologically, we introduce Group Relative Policy Optimization (GRPO), a joint optimization framework that simultaneously refines tool invocation policies and judgment accuracy; we further design a controllable data synthesis pipeline, generating over 27,000 high-quality preference pairs. Evaluated on three newly constructed knowledge-intensive benchmarks and two established ones, OpenRM consistently outperforms state-of-the-art reward models. Empirical results demonstrate that its evidence-grounded rewards significantly improve downstream large language models’ alignment performance—both during inference (e.g., best-of-N sampling) and training (e.g., RLHF).

Technology Category

Application Category

📝 Abstract

Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.

Problem

Research questions and friction points this paper is trying to address.

Reward models struggle with knowledge-intensive long-form tasks

Existing models cannot reliably evaluate correctness requiring external evidence

Current approaches fail to discriminate subtle quality differences in responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented reward model for evidence-based judgment

Training with GRPO on synthesized pairwise examples

Integration into both inference-time and training-time selection

🔎 Similar Papers

No similar papers found.