Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing preference learning methods—such as RLHF and DPO—rely solely on binary comparisons and neglect the inherent diversity of human preferences, leading to suboptimal alignment with heterogeneous user populations. Method: We propose a generalized preference learning framework based on ternary (or higher-order) rankings. Theoretically, we introduce econometric identification theory into preference modeling, proving that ternary rankings uniquely identify latent heterogeneous preferences under an infinite-user assumption. Algorithmically, we extend DPO to support user-type-aware personalized optimization and design a fairness-aware aggregation algorithm grounded in minimax regret. Implementationally, we integrate EM-based preference type discovery with hybrid DPO training under explicit fairness constraints. Contribution/Results: Our framework ensures identifiability and group fairness even under limited data, significantly improving alignment equilibrium across diverse user groups while preserving theoretical rigor and practical scalability.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of binary preference comparisons in RLHF

Incorporating heterogeneous human preferences into alignment algorithms

Developing theoretical framework for fair generative model personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ternary preferences for identifiability

Adapts DPO with EM for latent types

Aggregates policies with min-max regret fairness

🔎 Similar Papers

No similar papers found.

Authors to Follow