Why DPO is a Misspecified Estimator and How to Fix It

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

DPO suffers from model misspecification when the policy class cannot represent the true reward function, leading to preference reversal, policy degradation, and sensitivity to preference distribution. To address this, we propose AuxDPO: a method that introduces auxiliary variables to correct bias, models natural gradient updates in RLHF from a geometric perspective, and reformulates the loss function within a statistical estimation and supervised learning framework. Its core innovation lies in explicitly modeling the implicit reward function—thereby alleviating DPO’s strong dependence on the expressive capacity of the policy class. Experiments on teaching bandits and large language model alignment tasks demonstrate that AuxDPO consistently outperforms standard DPO, achieving superior robustness under preference noise. These results empirically validate AuxDPO’s effectiveness in mitigating model misspecification.

Technology Category

Application Category

📝 Abstract

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

Problem

Research questions and friction points this paper is trying to address.

DPO suffers from reward misspecification causing preference reversal

DPO exhibits sensitivity to input preference data distribution

AuxDPO introduces auxiliary variables to mitigate DPO limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

AuxDPO introduces auxiliary variables to DPO loss

Mitigates misspecification via principled RLHF approximation

Improves alignment through geometric policy characterization

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization