Bayesian Preference Learning for Test-Time Steerable Reward Models

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limited adaptability of traditional reward models to shifting preference distributions during testing, where dynamic adjustment capabilities are typically absent. To overcome this, the authors propose In-Context Reward Modeling (ICRM), which formulates reward learning as amortized variational inference over latent preference probabilities within the Bradley–Terry framework. By leveraging a Beta conjugate prior, ICRM enables Bayesian dynamic adaptation at test time through contextual preference examples. This approach is the first to support both single- and multi-objective preference alignment conditioned on context during inference, while offering theoretical guarantees for global optimality. Empirical results demonstrate that ICRM improves accuracy by 34% on SafeRLHF and 9% on RM-Bench, achieves a 4% gain in Pareto front hypervolume for multi-objective settings, and significantly outperforms baselines on mathematical reasoning tasks.

Technology Category

Application Category

📝 Abstract

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

Problem

Research questions and friction points this paper is trying to address.

reward models

test-time steerability

preference learning

Bayesian inference

multi-objective alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian preference learning

test-time steerability

in-context learning