Lightweight Robust Direct Preference Optimization

πŸ“… 2025-10-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Direct Preference Optimization (DPO) is highly susceptible to noise in preference data, leading to overfitting and degraded generalization. While existing distributionally robust optimization (DRO) approaches mitigate this issue, they often incur excessive computational overhead and yield overly conservative solutions. To address these limitations, we propose DPO-PROβ€”a lightweight, distributionally robust DPO algorithm that explicitly models uncertainty in the underlying preference distribution, thereby avoiding the conservatism inherent in conventional DRO methods. Crucially, DPO-PRO reformulates robustness as an equivalent regularization term that penalizes overconfident predictions on weak preference signals, drastically reducing computational cost. Extensive experiments on standard alignment benchmarks and real-world public health tasks demonstrate that DPO-PRO significantly improves robustness to noisy preferences while maintaining training stability and outperforming mainstream DPO variants in both accuracy and efficiency.

Technology Category

Application Category

πŸ“ Abstract
Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.
Problem

Research questions and friction points this paper is trying to address.

Addresses DPO's sensitivity to noisy preference data
Reduces excessive conservatism in robust optimization methods
Minimizes computational overhead while improving model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight DRO formulation for preference uncertainty
Penalizes model overconfidence under weak signals
Negligible computational overhead compared to prior methods
πŸ”Ž Similar Papers
No similar papers found.