Variance-aware Reward Modeling with Anchor Guidance

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Standard Bradley–Terry models struggle to accommodate multi-way preferences, while Gaussian reward models suffer from identifiability issues due to reliance solely on pairwise comparisons. This work proposes an anchor-guided, variance-aware reward modeling framework that introduces two coarse-grained response-level anchor labels to jointly learn both the mean and variance of rewards, accompanied by a tailored training objective. Theoretically, we prove that just two anchors suffice to resolve model non-identifiability and establish, for the first time, non-asymptotic convergence rates for both mean and variance functions. Empirical results demonstrate that the proposed method significantly improves reward modeling performance on synthetic data and four real-world multi-way preference datasets, thereby enhancing downstream RLHF tasks such as PPO training and best-of-N sampling.

📝 Abstract

Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of-$N$ selection.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

preference disagreement

non-identifiability

variance-aware

Bradley-Terry

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-aware reward modeling

anchor guidance

non-identifiability