The Reward Model Selection Crisis in Personalized Alignment

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work identifies a critical misalignment in personalized alignment: reward model (RM) ranking accuracy fails to predict actual user-aligned generation behavior under reward-guided decoding (RGD) inference. To address this, we propose “policy accuracy” as a new evaluation metric and introduce Pref-LaMP—the first personalized benchmark featuring real user completions. Through systematic empirical analysis, we demonstrate that RM accuracy is nearly uncorrelated with generation quality, rendering existing proxy metrics ineffective. Notably, in-context learning (ICL) methods outperform the best reward-based approaches by 3–5 points in ROUGE-1 on 7B models. Our findings falsify the prevailing optimization paradigm, establishing both theoretical foundations and practical benchmarks for evaluating and deploying personalized alignment systems.

Technology Category

Application Category

📝 Abstract

Personalized alignment from preference data has focused primarily on improving reward model (RM) accuracy, with the implicit assumption that better preference ranking translates to better personalized behavior. However, in deployment, computational constraints necessitate inference-time adaptation via reward-guided decoding (RGD) rather than per-user policy fine-tuning. This creates a critical but overlooked requirement: reward models must not only rank preferences accurately but also effectively guide token-level generation decisions. We demonstrate that standard RM accuracy fails catastrophically as a selection criterion for deployment-ready personalized alignment. Through systematic evaluation across three datasets, we introduce policy accuracy, a metric quantifying whether RGD scoring functions correctly discriminate between preferred and dispreferred responses. We show that RM accuracy correlates only weakly with this policy-level discrimination ability (Kendall's tau = 0.08--0.31). More critically, we introduce Pref-LaMP, the first personalized alignment benchmark with ground-truth user completions, enabling direct behavioral evaluation without circular reward-based metrics. On Pref-LaMP, we expose a complete decoupling between discrimination and generation: methods with 20-point RM accuracy differences produce almost identical output quality, and even methods achieving high discrimination fail to generate behaviorally aligned responses. Finally, simple in-context learning (ICL) dominates all reward-guided methods for models > 3B parameters, achieving 3-5 point ROUGE-1 gains over the best reward method at 7B scale. These findings show that the field optimizes proxy metrics that fail to predict deployment performance and do not translate preferences into real behavioral adaptation under deployment constraints.

Problem

Research questions and friction points this paper is trying to address.

Reward model accuracy fails to predict deployment performance in personalized alignment

Current metrics do not translate preferences into real behavioral adaptation

There is a decoupling between discrimination and generation in reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing policy accuracy metric for reward-guided decoding

Creating Pref-LaMP benchmark with ground-truth user completions

Showing in-context learning outperforms reward-guided methods

🔎 Similar Papers

No similar papers found.