🤖 AI Summary
DPO, proposed as a lightweight alternative to RLHF, empirically underperforms PPO-RLHF in preference alignment. This work identifies—through systematic analysis—a fundamental flaw in DPO: severe imbalance in gradient contributions from preference pairs, which destabilizes optimization trajectories and leads to suboptimal convergence. To address this, we propose Balanced-DPO, a theoretically grounded, implementation-light gradient reweighting method that requires no auxiliary models, additional data, or hyperparameter tuning. Derived from sensitivity analysis of the DPO objective’s gradients, Balanced-DPO seamlessly integrates into standard supervised fine-tuning pipelines. Evaluated across multiple LLM preference alignment benchmarks, Balanced-DPO consistently outperforms vanilla DPO (average win rate +3.2%), exhibits improved training stability, and substantially narrows the performance gap with PPO-RLHF.
📝 Abstract
Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.