Principled Foundations for Preference Optimization

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper establishes a rigorous theoretical foundation for Direct Preference Optimization (DPO) and clarifies its role within preference learning. Methodologically, it unifies Savage’s loss theory with the Doignon–Falmagne and Machina stochastic choice models within a single framework, introducing marginal and length correction techniques to systematically characterize the universal relationship between loss functions and stochastic choice behavior. The contributions are threefold: (1) It provides principled interpretations of DPO and its variants, exposing their implicit assumptions and limitations of existing approaches; (2) It naturally yields novel DPO formulations supporting abstention, non-convex objectives, and multi-alternative settings; (3) It constructs a scalable theoretical interface that enables the design of robust and flexible preference optimization algorithms. By bridging loss theory and stochastic choice, this work advances both the theoretical understanding and algorithmic innovation in preference learning.

Technology Category

Application Category

📝 Abstract

In this paper, we show that direct preference optimization (DPO) is a very specific form of a connection between two major theories in the ML context of learning from preferences: loss functions (Savage) and stochastic choice (Doignon-Falmagne and Machina). The connection is established for all of Savage's losses and at this level of generality, (i) it includes support for abstention on the choice theory side, (ii) it includes support for non-convex objectives on the ML side, and (iii) it allows to frame for free some notable extensions of the DPO setting, including margins and corrections for length. Getting to understand how DPO operates from a general principled perspective is crucial because of the huge and diverse application landscape of models, because of the current momentum around DPO, but also -- and importantly -- because many state of the art variations on DPO definitely occupy a small region of the map that we cover. It also helps to understand the pitfalls of departing from this map, and figure out workarounds.

Problem

Research questions and friction points this paper is trying to address.

Establishes connection between Savage's loss functions and stochastic choice theories

Extends DPO to support abstention and non-convex objectives

Frames notable DPO extensions like margins and length corrections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Connects Savage losses with stochastic choice theories

Supports abstention and non-convex objectives

Frames DPO extensions like margins and corrections

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization