C-3DPO: Constrained Controlled Classification for Direct Preference Optimization

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

DPO-based algorithms suffer from ill-posed preference modeling, leading to probability collapse in win/loss response probabilities—thereby degrading alignment stability and generalization. This work establishes, for the first time, an implicit classification theoretical framework for DPO, revealing it as an implicit binary classification problem under severe class imbalance, with probability collapse arising from unconstrained target distributions. To address this, we propose a controllable probability mass transfer constraint that explicitly regulates the allocation of probability mass between win and loss responses. Additionally, we design a preference loss compatible with RLHF interpretability and a reference-policy–target-policy divergence regularizer. Evaluated on multiple standard preference datasets, our method significantly outperforms vanilla DPO: it effectively mitigates probability collapse and enhances the stability, generalization, and training robustness of LLM alignment.

Technology Category

Application Category

📝 Abstract

Direct preference optimization (DPO)-style algorithms have emerged as a promising approach for solving the alignment problem in AI. We present a novel perspective that formulates these algorithms as implicit classification algorithms. This classification framework enables us to recover many variants of DPO-style algorithms by choosing appropriate classification labels and loss functions. We then leverage this classification framework to demonstrate that the underlying problem solved in these algorithms is under-specified, making them susceptible to probability collapse of the winner-loser responses. We address this by proposing a set of constraints designed to control the movement of probability mass between the winner and loser in the reference and target policies. Our resulting algorithm, which we call Constrained Controlled Classification DPO ( exttt{C-3DPO}), has a meaningful RLHF interpretation. By hedging against probability collapse, exttt{C-3DPO} provides practical improvements over vanilla exttt{DPO} when aligning several large language models using standard preference datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses alignment problem in AI

Prevents probability collapse in DPO

Improves large language model alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit classification framework

Constraints control probability mass

Hedging against probability collapse

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization