🤖 AI Summary
Existing DPO variants lack rigorous attribution analysis and fair comparative evaluation of their improvement components, hindering identification of genuinely effective technical pathways. Method: We propose the first unified preference optimization framework that systematically decomposes mainstream DPO enhancements into seven orthogonal dimensions—temperature scaling, reward normalization, dynamic margin, symmetric loss, top-k sampling, gradient reweighting, and multi-turn feedback modeling—and integrates them via a unified objective function to enable synergistic interaction. Contribution/Results: Our framework enables modular composition and quantitative attribution analysis for the first time. It substantially outperforms DPO, IPOL, KTO, and other baselines across multiple benchmarks, validating the efficacy of integrated strategies. Furthermore, we open-source a reusable implementation and practical guidelines to advance standardization and reproducibility in preference optimization research.
📝 Abstract
Recently, numerous preference optimization algorithms have been introduced as extensions to the Direct Preference Optimization (DPO) family. While these methods have successfully aligned models with human preferences, there is a lack of understanding regarding the contributions of their additional components. Moreover, fair and consistent comparisons are scarce, making it difficult to discern which components genuinely enhance downstream performance. In this work, we propose RainbowPO, a unified framework that demystifies the effectiveness of existing DPO methods by categorizing their key components into seven broad directions. We integrate these components into a single cohesive objective, enhancing the performance of each individual element. Through extensive experiments, we demonstrate that RainbowPO outperforms existing DPO variants. Additionally, we provide insights to guide researchers in developing new DPO methods and assist practitioners in their implementations.