Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper investigates the root causes of performance differences between RLHF and DPO under representational mismatch, disentangling the performance gap into an *explicit representational gap* (under exact optimization) and an *implicit representational gap* (under finite-sample learning). Leveraging theoretical analysis grounded in representation learning and statistical learning theory, we derive the first necessary and sufficient conditions for performance reversal between RLHF and DPO. We prove that online DPO can strictly outperform both RLHF and standard DPO when the reward and policy models share identical (but misspecified) architecture. Furthermore, we establish RLHF’s statistical sample efficiency advantage in implicitly sparse-reward settings. Our work establishes the first unified theoretical framework characterizing the precise applicability boundaries of RLHF and DPO, providing verifiable theoretical foundations and actionable guidelines for alignment method selection in large language models.

Technology Category

Application Category

📝 Abstract

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Problem

Research questions and friction points this paper is trying to address.

Analyzes performance gap between RLHF and DPO methods

Examines impact of model mis-specifications on policy quality

Compares sample efficiency of RLHF versus DPO

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes RLHF and DPO performance gap theoretically

Compares reward and policy model capacities impact

Shows RLHF's statistical advantage in sparse rewards

🔎 Similar Papers

No similar papers found.