RewardBench 2: Advancing Reward Model Evaluation

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models perform well on standard benchmarks but exhibit weak correlation with downstream task performance—such as RLHF convergence and inference-time sampling quality. Method: We introduce RewardBench 2, the first high-difficulty, multi-skill reward modeling benchmark designed for holistic evaluation. Its core innovation lies in systematically employing original, human-authored prompts—distinct from downstream task prompts—paired with high-quality human preference annotations and rigorous statistical validation. Contribution/Results: Compared to its predecessor, mainstream reward models score ~20 points lower on RewardBench 2, markedly improving discriminative power. Crucially, model scores demonstrate strong correlation (r > 0.85) with key downstream metrics—including inference-time scaling efficacy and RLHF training stability—thereby bridging the critical gap between reward model evaluation and real-world deployment.

Technology Category

Application Category

📝 Abstract
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reward models' accuracy across multiple skill domains
Assessing correlation between reward model performance and downstream tasks
Developing rigorous benchmarks with human prompts for better evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

New multi-skill reward modeling benchmark
Sources new human prompts for evaluation
Correlates benchmark performance with downstream tasks
🔎 Similar Papers
No similar papers found.