🤖 AI Summary
This work addresses the fundamental debate on whether Direct Preference Optimization (DPO) constitutes a reinforcement learning (RL) method, systematically clarifying its theoretical relationship with RL-based human feedback (RLHF) algorithms such as PPO. We propose UDRRA—the first unified framework for DPO–RL analysis—modeling their intrinsic connection along three dimensions: loss function construction, policy distribution convergence, and mechanistic roles of key components. We theoretically prove that DPO is an *implicit* RL algorithm: its optimization objective is equivalent to a policy gradient update under an *implicit reward model*, satisfying the core tenets of RL without requiring an explicit reward function. Rigorous derivation and empirical comparisons validate both convergence guarantees and objective equivalence. Our study establishes DPO’s formal theoretical grounding within the RL paradigm and provides a principled, interpretable, and extensible foundation for preference-based alignment algorithms.
📝 Abstract
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the impact of key components within the loss function. Specifically, we first establish a unified framework named UDRRA connecting these algorithms based on the construction of their loss functions. Next, we uncover their target policy distributions within this framework. Finally, we investigate the critical components of DPO to understand their impact on the convergence rate. Our work provides a deeper understanding of the relationship between DPO, RL, and other RLHF algorithms, offering new insights for improving existing algorithms.