🤖 AI Summary
DPO-based algorithms suffer from ill-posed preference modeling, leading to probability collapse in win/loss response probabilities—thereby degrading alignment stability and generalization. This work establishes, for the first time, an implicit classification theoretical framework for DPO, revealing it as an implicit binary classification problem under severe class imbalance, with probability collapse arising from unconstrained target distributions. To address this, we propose a controllable probability mass transfer constraint that explicitly regulates the allocation of probability mass between win and loss responses. Additionally, we design a preference loss compatible with RLHF interpretability and a reference-policy–target-policy divergence regularizer. Evaluated on multiple standard preference datasets, our method significantly outperforms vanilla DPO: it effectively mitigates probability collapse and enhances the stability, generalization, and training robustness of LLM alignment.
📝 Abstract
Direct preference optimization (DPO)-style algorithms have emerged as a promising approach for solving the alignment problem in AI. We present a novel perspective that formulates these algorithms as implicit classification algorithms. This classification framework enables us to recover many variants of DPO-style algorithms by choosing appropriate classification labels and loss functions. We then leverage this classification framework to demonstrate that the underlying problem solved in these algorithms is under-specified, making them susceptible to probability collapse of the winner-loser responses. We address this by proposing a set of constraints designed to control the movement of probability mass between the winner and loser in the reference and target policies. Our resulting algorithm, which we call Constrained Controlled Classification DPO ( exttt{C-3DPO}), has a meaningful RLHF interpretation. By hedging against probability collapse, exttt{C-3DPO} provides practical improvements over vanilla exttt{DPO} when aligning several large language models using standard preference datasets.