🤖 AI Summary
Existing RLHF methods suffer significant performance degradation under prompt distribution shift (out-of-distribution, OOD), exhibiting insufficient robustness. This work proposes the first RLHF framework explicitly designed for distributional shift, systematically integrating Distributionally Robust Optimization (DRO) into both reward modeling and policy optimization. First, we construct a robust reward model to mitigate mismatch between the preference data distribution and the downstream prompt distribution. Second, we design a robust Direct Preference Optimization (DPO) algorithm that supports mini-batch updates and enjoys theoretical convergence guarantees. Under an OOD evaluation paradigm, our approach substantially improves reward model accuracy—particularly on reasoning tasks—and enhances policy generalization across shifted distributions. Empirically, it achieves consistent gains in alignment fidelity and task performance under distributional shift, providing critical robustness guarantees for practical RLHF deployment.
📝 Abstract
Reinforcement learning from human feedback (RLHF) has evolved to be one of the main methods for fine-tuning large language models (LLMs). However, existing RLHF methods are non-robust, and their performance deteriorates if the downstream task differs significantly from the preference dataset used in fine-tuning. In order to mitigate this problem, we introduce a distributionally robust RLHF for fine-tuning LLMs. In particular, our goal is to ensure that a fine-tuned model retains its performance even when the distribution of prompts significantly differs from the distribution encountered during fine-tuning. We formulate distributionally robust optimization (DRO) version of two popular fine-tuning methods -- (1) reward-based RLHF and (2) reward-free DPO (direct preference optimization). We propose a minibatch gradient descent based algorithms for both of them, and theoretically prove convergence guarantees for the algorithms. Subsequently, we evaluate our algorithms on an out-of-distribution (OOD) task by first training the model on the Unified-Feedback dataset and evaluating its performance on two different datasets. The experimental results show that our robust training improves the accuracy of the learned reward models on average, and markedly on some tasks, such as reasoning. Furthermore, we show that the robust versions of policy optimization methods, similarly improve performance on OOD tasks.