Explicit Preference Optimization: No Need for an Implicit Reward Model

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing implicit reward modeling paradigms—such as Direct Preference Optimization (DPO)—suffer from suboptimal regularization and counterintuitive interpolation behavior. To address these limitations, we propose Explicit Preference Optimization (EXPO), the first fully explicit preference alignment framework that abandons reparameterization-induced implicit reward modeling. EXPO is grounded in the Bradley–Terry model, yielding an interpretable and verifiable regularization objective. We provide theoretical guarantees that EXPO satisfies ideal regularization properties—thereby eliminating the fundamental flaws inherent to DPO-style methods. Empirically, EXPO consistently outperforms DPO and its variants across multiple benchmarks, demonstrating superior training stability, higher preference consistency, and stronger generalization capability.

Technology Category

Application Category

📝 Abstract

The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an extit{implicit} reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an extit{explicit} preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM responses without implicit reward models

Addressing sub-optimal regularization in DPO-based methods

Introducing EXPO for transparent preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

EXPO avoids implicit reward reparameterization

Introduces intuitive regularization factors

Transparently avoids DPO pitfalls

🔎 Similar Papers

No similar papers found.