🤖 AI Summary
Existing LLM alignment methods—such as RLHF and DPO—lack rigorous theoretical foundations, often suffering from overfitting and deterministic collapse. This work reframes alignment as **learning a target distribution from pairwise preference feedback**, introducing the first distribution-learning-based alignment framework that explicitly models information leakage in preference data. Our contributions are threefold: (1) We propose three theoretically grounded, non-degenerate objectives—Preference Maximum Likelihood Estimation, Preference Distillation, and Inverse KL Minimization—each provably avoiding degeneracy; (2) We establish an $O(1/n)$ non-asymptotic convergence rate for these objectives; (3) Empirical results demonstrate that Preference Distillation consistently matches or outperforms RLHF and DPO across diverse tasks and model architectures. By unifying alignment with principled distribution learning, our framework resolves key theoretical and practical limitations of conventional approaches.
📝 Abstract
Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.