Direct Preference Optimization With Unobserved Preference Heterogeneity

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 11
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
Existing RLHF and DPO methods assume homogeneous human preferences and rely solely on binary comparisons, failing to capture the heterogeneity and long-tailed distribution of annotator preferences in realistic settings. To address this, we propose the first direct preference optimization framework explicitly designed for preference heterogeneity. Our method (1) employs an EM variant to jointly infer latent preference types and model parameters; (2) adopts a mixture-of-experts architecture to model the multi-annotator preference distribution; and (3) introduces a min-max regret ensemble mechanism to enhance policy robustness—particularly with respect to subgroup fairness and worst-case performance. Evaluated across multiple benchmarks exhibiting strong preference heterogeneity, our approach significantly outperforms standard DPO and RLHF: it achieves an average 23.6% improvement in fairness metrics and an 18.4% gain in worst-case win rate.

Technology Category

Application Category

📝 Abstract
RLHF has emerged as a pivotal step in aligning language models with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of binary preference comparisons in language model alignment
Incorporating diverse human preferences into model training algorithms
Developing fair generative policies for heterogeneous user populations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ternary preferences to ensure latent preference identifiability
Introduces EM-based DPO adaptation for heterogeneous annotator modeling
Proposes min-max regret aggregation for equitable policy guarantees