Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses a critical limitation in existing diffusion model preference optimization methods, which compress multidimensional human preferences into binary labels, often leading to conflicting gradients and severe label noise. To mitigate this issue, the paper proposes Semi-DPO, the first approach to integrate semi-supervised learning into Direct Preference Optimization (DPO). Semi-DPO identifies consensus-aligned preference pairs as clean labeled data through consensus filtering, treats conflicting pairs as unlabeled noisy samples, and iteratively refines the model by generating pseudo-labels after an initial training phase. Notably, this method requires neither additional human annotations nor an explicit reward model. Empirical results demonstrate that Semi-DPO effectively alleviates noise arising from multidimensional preferences and achieves state-of-the-art performance on complex human preference alignment tasks.

📝 Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

Problem

Research questions and friction points this paper is trying to address.

noisy preferences

multi-dimensional preferences

label noise

Direct Preference Optimization

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-Supervised Learning

Direct Preference Optimization

Label Noise