RePO: ReLU-based Preference Optimization

πŸ“… 2025-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost, hyperparameter sensitivity, and reliance on reference models in aligning large language models (LLMs) with human preferences, this paper proposes ReLPOβ€”a reference-free, temperature-parameter-free offline preference optimization method. Methodologically, ReLPO introduces the ReLU-based large-margin loss to preference learning for the first time, theoretically establishing it as the limiting form of SimPO as Ξ² β†’ ∞. By analyzing gradient dynamics, we design a reference-free margin, leverage ReLU to approximate the convex 0–1 loss, and implicitly achieve binary logical weighting and gradient clipping. ReLPO requires tuning only a single hyperparameter. Empirically, it consistently outperforms DPO and SimPO on AlpacaEval 2 and Arena-Hard, demonstrating strong generalization across diverse base models.

Technology Category

Application Category

πŸ“ Abstract
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $eta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($eta$, $gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $eta$ via two advances: (1) retaining SimPO's reference-free margins but removing $eta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($eta o infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human preferences efficiently
Reducing complexity in preference optimization methods
Improving computational stability and performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReLU-based max-margin loss filters trivial pairs
Eliminates hyperparameter Ξ² via gradient analysis
Streamlined algorithm outperforms DPO and SimPO
πŸ”Ž Similar Papers
No similar papers found.