Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses limitations in existing Direct Preference Optimization (DPO)-style algorithms, which rely on the convexity assumption of $f$-divergences and are susceptible to the "probability displacement" issue—where response probabilities approach zero. The paper makes the novel observation that the $f$-function need not be convex and introduces a "DPO-inducing" condition to relax this restrictive assumption. Furthermore, it proposes a "displacement-resistant" condition to mitigate probability displacement. Building on these insights, the authors develop a more robust policy optimization framework and design a new loss function, SquaredPO. Theoretical analysis and empirical results demonstrate that SquaredPO significantly alleviates the probability displacement problem while maintaining practical performance comparable to DPO, offering stronger theoretical guarantees.

Technology Category

Application Category

📝 Abstract

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

Problem

Research questions and friction points this paper is trying to address.

DPO

f-divergence

probability displacement

nonconvex

RLHF

Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO-inducing

displacement-resistant

f-divergence