Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-reference preference optimization (MRPO) methods lack theoretical grounding and statistical robustness in assigning reference model weights, leading to unstable DPO alignment performance. To address this, we propose four novel reference weighting strategies: two offline methods leveraging validation signals, one sliding-window-based online estimation method, and one online adaptive method integrating Thompson sampling within a multi-armed bandit framework. Crucially, we provide the first systematic empirical evidence that single-reference DPO often outperforms multi-reference variants—challenging the assumed necessity of multiple references. Extensive experiments on the Qwen2.5-0.5B policy model paired with seven reference models (ranging from 0.5B to 14B parameters) demonstrate that all proposed strategies significantly surpass state-of-the-art MRPO approaches, achieving substantial gains in preference accuracy on UltraFeedback and SafeRLHF benchmarks. Our work establishes a more reliable, interpretable, and theoretically grounded paradigm for reference weight modeling in LLM alignment.

Technology Category

Application Category

📝 Abstract
Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.
Problem

Research questions and friction points this paper is trying to address.

Develops weighting strategies for multiple reference models in DPO
Addresses ad-hoc and unreliable weight setting in MRPO methods
Evaluates single versus multiple reference model effectiveness in fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline weighting using validation signal
Online sliding-window estimator reducing overfitting
Thompson Sampling as K-armed bandit for weighting
🔎 Similar Papers
No similar papers found.