Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM alignment methods—such as RLHF and DPO—lack rigorous theoretical foundations, often suffering from overfitting and deterministic collapse. This work reframes alignment as **learning a target distribution from pairwise preference feedback**, introducing the first distribution-learning-based alignment framework that explicitly models information leakage in preference data. Our contributions are threefold: (1) We propose three theoretically grounded, non-degenerate objectives—Preference Maximum Likelihood Estimation, Preference Distillation, and Inverse KL Minimization—each provably avoiding degeneracy; (2) We establish an $O(1/n)$ non-asymptotic convergence rate for these objectives; (3) Empirical results demonstrate that Preference Distillation consistently matches or outperforms RLHF and DPO across diverse tasks and model architectures. By unifying alignment with principled distribution learning, our framework resolves key theoretical and practical limitations of conventional approaches.

Technology Category

Application Category

📝 Abstract
Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.
Problem

Research questions and friction points this paper is trying to address.

RLHF lacks theoretical justification and causes degenerate solutions
Alignment is reframed as distribution learning from preference feedback
Proposed objectives avoid degeneracy and outperform RLHF and DPO
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framing alignment as distribution learning
Proposing three principled learning objectives
Avoiding degeneracy and reward overfitting
🔎 Similar Papers
No similar papers found.