π€ AI Summary
This work addresses limitations in existing Direct Preference Optimization (DPO)-style algorithms, which rely on the convexity assumption of $f$-divergences and are susceptible to the "probability displacement" issueβwhere response probabilities approach zero. The paper makes the novel observation that the $f$-function need not be convex and introduces a "DPO-inducing" condition to relax this restrictive assumption. Furthermore, it proposes a "displacement-resistant" condition to mitigate probability displacement. Building on these insights, the authors develop a more robust policy optimization framework and design a new loss function, SquaredPO. Theoretical analysis and empirical results demonstrate that SquaredPO significantly alleviates the probability displacement problem while maintaining practical performance comparable to DPO, offering stronger theoretical guarantees.
π Abstract
DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.