🤖 AI Summary
Existing reward models perform well on standard benchmarks but exhibit weak correlation with downstream task performance—such as RLHF convergence and inference-time sampling quality.
Method: We introduce RewardBench 2, the first high-difficulty, multi-skill reward modeling benchmark designed for holistic evaluation. Its core innovation lies in systematically employing original, human-authored prompts—distinct from downstream task prompts—paired with high-quality human preference annotations and rigorous statistical validation.
Contribution/Results: Compared to its predecessor, mainstream reward models score ~20 points lower on RewardBench 2, markedly improving discriminative power. Crucially, model scores demonstrate strong correlation (r > 0.85) with key downstream metrics—including inference-time scaling efficacy and RLHF training stability—thereby bridging the critical gap between reward model evaluation and real-world deployment.
📝 Abstract
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.