Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current reward models lack effective mechanisms to evaluate individual user preferences, limiting their ability to support personalized alignment in diverse settings. To address this gap, this work proposes Personalized RewardBench—the first benchmark specifically designed to assess reward models’ capacity to capture user-specific preferences, constructed from human-annotated pairwise response data reflecting individualized judgments. The benchmark rigorously evaluates discriminative performance through Best-of-N sampling, PPO optimization, and correlation analysis. Experimental results reveal that even state-of-the-art reward models achieve only 75.94% accuracy on this task, while Personalized RewardBench demonstrates significantly stronger predictive validity for downstream task performance, thereby confirming its effectiveness and practical utility in personalized alignment scenarios.
📝 Abstract
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.
Problem

Research questions and friction points this paper is trying to address.

personalized reward models
pluralistic alignment
preference evaluation
reward model benchmarking
individual user preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized RewardBench
reward models
pluralistic alignment
preference modeling
downstream correlation
🔎 Similar Papers
No similar papers found.