Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing LLM alignment methods—such as RLHF and DPO—lack rigorous theoretical foundations, often suffering from overfitting and deterministic collapse. This work reframes alignment as **learning a target distribution from pairwise preference feedback**, introducing the first distribution-learning-based alignment framework that explicitly models information leakage in preference data. Our contributions are threefold: (1) We propose three theoretically grounded, non-degenerate objectives—Preference Maximum Likelihood Estimation, Preference Distillation, and Inverse KL Minimization—each provably avoiding degeneracy; (2) We establish an $O(1/n)$ non-asymptotic convergence rate for these objectives; (3) Empirical results demonstrate that Preference Distillation consistently matches or outperforms RLHF and DPO across diverse tasks and model architectures. By unifying alignment with principled distribution learning, our framework resolves key theoretical and practical limitations of conventional approaches.

Technology Category

Application Category

📝 Abstract

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

Problem

Research questions and friction points this paper is trying to address.

RLHF lacks theoretical justification and causes degenerate solutions

Alignment is reframed as distribution learning from preference feedback

Proposed objectives avoid degeneracy and outperform RLHF and DPO

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framing alignment as distribution learning

Proposing three principled learning objectives

Avoiding degeneracy and reward overfitting

🔎 Similar Papers

Is Free Self-Alignment Possible?