π€ AI Summary
This work establishes that the equivalence between Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) hinges on a commonly violated implicit assumption: that the optimal RLHF policy must strictly prefer human-preferred responses. When this assumption fails, DPO merely optimizes relative advantages over a reference policy, potentially leading to pathological convergence rather than genuine alignment with human preferences. To address this, we propose Constrained Preference Optimization (CPO), a framework that retains simplicity while offering provable alignment guarantees. Through theoretical analysis, geometric interpretation via soft-margin ranking, constrained optimization, and large-scale experiments, we demonstrate that CPO achieves state-of-the-art performance on standard benchmarks. Our work also formally characterizes the conditions under which DPO and RLHF are equivalent, clarifying both the validity regime and failure modes of DPO.
π Abstract
Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.