π€ AI Summary
To address the high computational cost, hyperparameter sensitivity, and reliance on reference models in aligning large language models (LLMs) with human preferences, this paper proposes ReLPOβa reference-free, temperature-parameter-free offline preference optimization method. Methodologically, ReLPO introduces the ReLU-based large-margin loss to preference learning for the first time, theoretically establishing it as the limiting form of SimPO as Ξ² β β. By analyzing gradient dynamics, we design a reference-free margin, leverage ReLU to approximate the convex 0β1 loss, and implicitly achieve binary logical weighting and gradient clipping. ReLPO requires tuning only a single hyperparameter. Empirically, it consistently outperforms DPO and SimPO on AlpacaEval 2 and Arena-Hard, demonstrating strong generalization across diverse base models.
π Abstract
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $eta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($eta$, $gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $eta$ via two advances: (1) retaining SimPO's reference-free margins but removing $eta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($eta o infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.