π€ AI Summary
Standard BradleyβTerry models struggle to accommodate multi-way preferences, while Gaussian reward models suffer from identifiability issues due to reliance solely on pairwise comparisons. This work proposes an anchor-guided, variance-aware reward modeling framework that introduces two coarse-grained response-level anchor labels to jointly learn both the mean and variance of rewards, accompanied by a tailored training objective. Theoretically, we prove that just two anchors suffice to resolve model non-identifiability and establish, for the first time, non-asymptotic convergence rates for both mean and variance functions. Empirical results demonstrate that the proposed method significantly improves reward modeling performance on synthetic data and four real-world multi-way preference datasets, thereby enhancing downstream RLHF tasks such as PPO training and best-of-N sampling.
π Abstract
Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of-$N$ selection.