Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing DPO methods treat all preference pairs uniformly, ignoring their intrinsic quality differences and the model’s evolving learning dynamics, leading to inefficient data utilization. To address this, we propose a dual-perspective optimization framework that introduces, for the first time, a dynamic dual-weighting mechanism—jointly modeling both the inherent quality of preference pairs and the model’s performance evolution on them—to enable adaptive sample reweighting during training. Our method integrates a quality assessment module and a learning-dynamics awareness module, requiring no auxiliary reward model or reinforcement learning components. Evaluated on Gemma-2-9B-it, it achieves a +6.7 Arena-Hard score over Claude 3 Opus and consistently outperforms baselines on mathematical reasoning tasks, demonstrating strong generalization and robustness. Key contributions include: (1) a dynamic dual-weighting mechanism; (2) a quality–dynamics joint modeling paradigm; and (3) an efficient, RLHF-free DPO enhancement framework.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizes dynamic preference learning in LLMs by dual-perspective weighting
Addresses suboptimal data utilization in uniform DPO approaches
Improves model performance via adaptive quality and learning-aware training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-perspective optimization framework
Adaptive weighting based on quality
Dynamic learning performance consideration
🔎 Similar Papers
No similar papers found.