Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference learning methods—such as RLHF and DPO—rely solely on binary comparisons and neglect the inherent diversity of human preferences, leading to suboptimal alignment with heterogeneous user populations. Method: We propose a generalized preference learning framework based on ternary (or higher-order) rankings. Theoretically, we introduce econometric identification theory into preference modeling, proving that ternary rankings uniquely identify latent heterogeneous preferences under an infinite-user assumption. Algorithmically, we extend DPO to support user-type-aware personalized optimization and design a fairness-aware aggregation algorithm grounded in minimax regret. Implementationally, we integrate EM-based preference type discovery with hybrid DPO training under explicit fairness constraints. Contribution/Results: Our framework ensures identifiability and group fairness even under limited data, significantly improving alignment equilibrium across diverse user groups while preserving theoretical rigor and practical scalability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of binary preference comparisons in RLHF
Incorporating heterogeneous human preferences into alignment algorithms
Developing theoretical framework for fair generative model personalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ternary preferences for identifiability
Adapts DPO with EM for latent types
Aggregates policies with min-max regret fairness
🔎 Similar Papers
No similar papers found.
K
Keertana Chidambaram
Stanford University
K
Karthik Vinary Seetharaman
Stanford University
Vasilis Syrgkanis
Vasilis Syrgkanis
Assistant Professor, Stanford University
Machine LearningCausal InferenceEconometricsGame TheoryMechanism Design