RePO: ReLU-based Preference Optimization

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

To address the high computational cost, hyperparameter sensitivity, and reliance on reference models in aligning large language models (LLMs) with human preferences, this paper proposes ReLPO—a reference-free, temperature-parameter-free offline preference optimization method. Methodologically, ReLPO introduces the ReLU-based large-margin loss to preference learning for the first time, theoretically establishing it as the limiting form of SimPO as β → ∞. By analyzing gradient dynamics, we design a reference-free margin, leverage ReLU to approximate the convex 0–1 loss, and implicitly achieve binary logical weighting and gradient clipping. ReLPO requires tuning only a single hyperparameter. Empirically, it consistently outperforms DPO and SimPO on AlpacaEval 2 and Arena-Hard, demonstrating strong generalization across diverse base models.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $eta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($eta$, $gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $eta$ via two advances: (1) retaining SimPO's reference-free margins but removing $eta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($eta o infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human preferences efficiently

Reducing complexity in preference optimization methods

Improving computational stability and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReLU-based max-margin loss filters trivial pairs

Eliminates hyperparameter β via gradient analysis

Streamlined algorithm outperforms DPO and SimPO

🔎 Similar Papers

No similar papers found.