🤖 AI Summary
Existing multi-reference preference optimization (MRPO) methods lack theoretical grounding and statistical robustness in assigning reference model weights, leading to unstable DPO alignment performance. To address this, we propose four novel reference weighting strategies: two offline methods leveraging validation signals, one sliding-window-based online estimation method, and one online adaptive method integrating Thompson sampling within a multi-armed bandit framework. Crucially, we provide the first systematic empirical evidence that single-reference DPO often outperforms multi-reference variants—challenging the assumed necessity of multiple references. Extensive experiments on the Qwen2.5-0.5B policy model paired with seven reference models (ranging from 0.5B to 14B parameters) demonstrate that all proposed strategies significantly surpass state-of-the-art MRPO approaches, achieving substantial gains in preference accuracy on UltraFeedback and SafeRLHF benchmarks. Our work establishes a more reliable, interpretable, and theoretically grounded paradigm for reference weight modeling in LLM alignment.
📝 Abstract
Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.