Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

📅 2024-10-03

📈 Citations: 3

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Traditional Bradley–Terry (BT) reward models struggle to capture intransitivity and cyclic preferences, limiting large language models’ alignment with fine-grained human values. To address this, we propose a general preference modeling framework: first, a preference embedding mechanism jointly represents candidate responses and preference structures in a latent space; second, a Generalized Preference Optimization (GPO) algorithm—novelly enabling learnable modeling of cyclic preferences—overcomes the theoretical limitations of BT models in intransitive settings. Our framework establishes a score-based RLHF generalization paradigm. Experiments demonstrate that GPO consistently outperforms BT baselines on RewardBench, significantly improves language model win rates on AlpacaEval 2.0 downstream tasks, and achieves accurate modeling even on synthetic cyclic-preference data where BT fails completely.

Technology Category

Application Category

📝 Abstract

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Problem

Research questions and friction points this paper is trying to address.

Modeling human preferences for alignment

Improving expressiveness over Bradley-Terry models

Capturing intricate and cyclic preferences efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference embedding captures intricate structures

General Preference Optimization generalizes RLHF

GPM outperforms BT on RewardBench

🔎 Similar Papers

Is Free Self-Alignment Possible?