Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the generalization properties of Reinforcement Learning from Human Feedback (RLHF) under reward model misspecification and when the KL regularization term is estimated and clipped. It presents the first theoretical analysis that explicitly incorporates reward bias and KL clipping error into the generalization framework of RLHF, deriving a bound that depends on the distributions of prompts, trajectories, and preference data. By modeling policy optimization as an Ornstein-Uhlenbeck process and combining generalization bounds with error decomposition techniques, the study reveals that generalization error is jointly dominated by sampling error, reward bias, and KL clipping error. Building on these insights, the paper proposes an optimal KL clipping threshold and a data budget allocation strategy, offering principled guidance for enhancing the robustness of RLHF systems.

Technology Category

Application Category

📝 Abstract

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

Problem

Research questions and friction points this paper is trying to address.

RLHF

reward shift

clipped KL regularisation

generalisation

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward shift

clipped KL regularisation

generalisation bounds