🤖 AI Summary
This work addresses the challenge of jointly tuning the β and γ hyperparameters in SimPO, which stems from the lack of interpretability in its margin formulation under varying reward gap structures. By reformulating the preference optimization objective through an equivalent transformation, the method minimizes the distance between the reward difference and an optimal margin, while introducing a ratio-based reward between chosen and rejected responses to eliminate dependence on β. This yields a bounded and interpretable ratio reward margin ξ that explicitly quantifies the desired degree of relative separation and can be directly derived from the initial reward gap distribution, thereby avoiding iterative hyperparameter tuning. Integrating preference optimization, ratio-based reward modeling, and reference-model-free alignment, the proposed approach achieves more stable and efficient training with stronger preference alignment across multiple datasets.
📝 Abstract
Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $β$ and $γ$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $β$ implicitly controls sample filtering, while the effect of $γ$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $ξ$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $β$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $ξ$. Unlike the margin $γ$ in SimPO, $ξ$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....