🤖 AI Summary
This paper investigates the theoretical foundations and algorithmic design of KL-regularized reinforcement learning from human feedback (RLHF) under ε-local differential privacy (ε-LDP). Addressing both offline and online settings, it establishes the first unified theoretical framework for KL-regularized RLHF under LDP. For the offline setting, we propose a pessimistic algorithm achieving an optimal suboptimality gap of $ ilde{O}(1/[(e^varepsilon-1)^2 n])$ under single-policy concentrability. For the online setting, we design an optimistic algorithm attaining a logarithmic regret bound of $O(d_{mathcal{F}} log(N_{mathcal{F}} T)/(e^varepsilon-1)^2)$ under the eluder dimension assumption on the reward function class $mathcal{F}$, with matching lower bounds provided in both cases. Additionally, we deliver the first complete analysis of non-private online RLHF as a foundational baseline. All theoretical results are empirically validated via open-sourced implementation.
📝 Abstract
In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $ε$ local differential privacy ($ε$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $ ilde{O}(1/[(e^ε-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size.
In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{mathcal{F}}log (N_{mathcal{F}}cdot T) /(e^ε-1)^2 )$, where $T$ is the total time step, $N_{mathcal{F}}$ is cardinality of the reward function space $mathcal{F}$ and $d_{mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.