InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental trade-off between distributional shift and data quality in Direct Preference Optimization (DPO). We propose InCo-DPO, a dynamic collaborative optimization framework that jointly leverages on-policy and off-policy data. Its core innovation is the first-of-its-kind dual-objective dynamic balancing mechanism—simultaneously optimizing for distribution consistency and data quality—implemented via learnable sample confidence weighting and a distribution correction module, enabling safe integration of heterogeneous off-policy data. By breaking DPO’s strict reliance on on-policy data alone, InCo-DPO achieves a 60.8% win rate on the Arena-Hard benchmark using the Gemma-2 model, substantially outperforming both pure on-policy and pure off-policy baselines and establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.
Problem

Research questions and friction points this paper is trying to address.

Balancing distribution shift and data quality in preference optimization.
Integrating on-policy and off-policy data for enhanced model performance.
Overcoming limitations of distribution shifts and quality constraints in DPO.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates on-policy and off-policy data
Balances distribution shifts and data quality
Achieves state-of-the-art performance benchmarks
Y
Yunan Wang
Beihang University
J
Jijie Li
Beijing Academy of Artificial Intelligence
B
Bo-Wen Zhang
Beijing Academy of Artificial Intelligence
L
Liangdong Wang
Beijing Academy of Artificial Intelligence
Guang Liu
Guang Liu
BAAI
AI,LLMData