Doubly Robust Alignment for Large Language Models

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This paper addresses the poor robustness of Reinforcement Learning from Human Feedback (RLHF) to model misspecification—arising from incorrect preference models (e.g., Bradley–Terry), reference policies, or reward functions. We propose the first doubly robust preference optimization algorithm, which guarantees consistent estimation if *either* the preference model *or* the reference policy is correctly specified—significantly enhancing tolerance to misspecification. Our method integrates doubly robust estimation with preference modeling, eliminating the need for accurate reward modeling or explicit policy constraints. We establish its asymptotic optimality through theoretical analysis and demonstrate superior empirical performance over state-of-the-art methods across multiple benchmarks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

Problem

Research questions and friction points this paper is trying to address.

Addresses sensitivity to misspecified preference models in RLHF

Proposes doubly robust algorithm for alignment without both models correct

Improves robustness and performance over state-of-the-art methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly robust preference optimization algorithm

Consistent with single correct specification

Superior robust performance theoretically practically

🔎 Similar Papers

No similar papers found.