🤖 AI Summary
In DPO, initializing the policy and reference models identically leads to inefficient data utilization and suboptimal performance; conversely, SimPO’s elimination of the reference model sacrifices robustness and risks catastrophic forgetting. To address this trade-off, we propose the Guided Reference Model (GRM), the first framework to formally characterize the reference model as a *dynamic sample-weighting mechanism* over preference pairs. GRM replaces the fixed reference model with a lightweight, learnable module that performs sample-level adaptive weighting—without introducing auxiliary reward models, external data, or additional parameters. This design seamlessly integrates DPO’s data efficiency with SimPO’s robustness. Empirical evaluation shows that GRM significantly outperforms standard DPO and SimPO on AlpacaEval 2.0 and Arena-Hard v0.1, delivering zero-cost improvements in reasoning alignment while incurring no increase in inference latency or model parameter count.
📝 Abstract
Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.