Gradient Imbalance in Direct Preference Optimization

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

DPO, proposed as a lightweight alternative to RLHF, empirically underperforms PPO-RLHF in preference alignment. This work identifies—through systematic analysis—a fundamental flaw in DPO: severe imbalance in gradient contributions from preference pairs, which destabilizes optimization trajectories and leads to suboptimal convergence. To address this, we propose Balanced-DPO, a theoretically grounded, implementation-light gradient reweighting method that requires no auxiliary models, additional data, or hyperparameter tuning. Derived from sensitivity analysis of the DPO objective’s gradients, Balanced-DPO seamlessly integrates into standard supervised fine-tuning pipelines. Evaluated across multiple LLM preference alignment benchmarks, Balanced-DPO consistently outperforms vanilla DPO (average win rate +3.2%), exhibits improved training stability, and substantially narrows the performance gap with PPO-RLHF.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.

Problem

Research questions and friction points this paper is trying to address.

Identifies gradient imbalance in Direct Preference Optimization (DPO).

Proposes Balanced-DPO to address gradient imbalance issues.

Demonstrates improved performance with Balanced-DPO in experiments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies gradient imbalance in DPO training

Proposes Balanced-DPO with gradient reweighting

Validates improved performance via experiments

🔎 Similar Papers

The Crucial Role of Samplers in Online Direct Preference Optimization