Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLHF methods rely on the Bradley–Terry reward model, whose strong parametric assumptions fail to capture the complexity and noise inherent in real human preferences, leading to reward misspecification and policy degradation. This work proposes a robust RLHF framework addressing these limitations. First, it theoretically unifies variance reduction for both reward estimation and policy gradient estimation, yielding a significantly tightened regret bound. Second, it abandons the Bradley–Terry assumption and instead incorporates robust statistical estimation with explicit bias–variance trade-off analysis, enabling effective modeling of heterogeneous preference data and label noise. Third, empirical evaluation on the Anthropic Helpful and Harmless benchmark demonstrates that 77–81% of the framework’s responses outperform those of baseline methods, concurrently improving alignment performance and generalization stability. The approach thus advances RLHF by enhancing both theoretical rigor and practical robustness to real-world preference data.

Technology Category

Application Category

📝 Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset.
Problem

Research questions and friction points this paper is trying to address.

Enhance RLHF robustness for LLM fine-tuning
Address reward model misspecifications in RLHF
Improve variance reduction in reward estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust algorithm for reward model misspecifications
Reduces variance of reward and policy estimators
Outperforms baselines on Anthropic Helpful and Harmless dataset
🔎 Similar Papers
No similar papers found.
K
Kai Ye
Department of Statistics, LSE
Hongyi Zhou
Hongyi Zhou
Karlsruhe Institute of Technology
reinforcement learningimitation learningrobotics
J
Jin Zhu
Department of Statistics, LSE
Francesco Quinzan
Francesco Quinzan
University of Oxford
C
Chengchung Shi
Department of Statistics, LSE