Doubly Robust Alignment for Large Language Models

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the poor robustness of Reinforcement Learning from Human Feedback (RLHF) to model misspecification—arising from incorrect preference models (e.g., Bradley–Terry), reference policies, or reward functions. We propose the first doubly robust preference optimization algorithm, which guarantees consistent estimation if *either* the preference model *or* the reference policy is correctly specified—significantly enhancing tolerance to misspecification. Our method integrates doubly robust estimation with preference modeling, eliminating the need for accurate reward modeling or explicit policy constraints. We establish its asymptotic optimality through theoretical analysis and demonstrate superior empirical performance over state-of-the-art methods across multiple benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
Problem

Research questions and friction points this paper is trying to address.

Addresses sensitivity to misspecified preference models in RLHF
Proposes doubly robust algorithm for alignment without both models correct
Improves robustness and performance over state-of-the-art methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly robust preference optimization algorithm
Consistent with single correct specification
Superior robust performance theoretically practically
🔎 Similar Papers
No similar papers found.