🤖 AI Summary
This work addresses the limitation of existing preference alignment methods, which typically assume homogeneous preference strength and thus fail to capture the heterogeneity arising from individual and contextual differences in real-world scenarios. To overcome this, we propose MixDPO, the first approach to integrate the mixed logit model from behavioral economics into language model alignment. By extending Direct Preference Optimization, MixDPO explicitly models the distributional variation of preference strength. Grounded in discrete choice theory and deep learning, our method learns the parameters of preference strength distributions directly from preference data, enabling the alignment objective to distinguish between strong and weak preferences. Experiments across three datasets and two open-source models demonstrate that MixDPO improves average alignment performance by 11.2 points on Pythia-2.8B, with particularly pronounced gains in high-heterogeneity settings, while effectively preserving subgroup-specific preferences.
📝 Abstract
Preference based alignment objectives implicitly assume that all human preferences are expressed with equal strength. In practice, however, preference strength varies across individuals and contexts -- a phenomenon established in behavioral economics and discrete choice theory. This mismatch limits the ability of existing objectives to faithfully capture heterogeneous human judgments. Inspired by this literature, we introduce Mixed Logit Direct Preference Optimization (MixDPO), a generalization of Direct Preference Optimization that models variation in preference strength. MixDPO enables alignment objectives to capture heterogeneity in how strongly preferences are expressed across training examples. We evaluate MixDPO on three preference datasets using two open-weight language models. Across datasets, MixDPO improves aggregate alignment performance (+11.2 points on Pythia-2.8B) while preserving subgroup level preferences, with the largest gains appearing in settings with higher inferred preference heterogeneity. MixDPO makes preference heterogeneity explicit through learned strength distributions. We release our code for reproducibility.