Offline and Online KL-Regularized RLHF under Differential Privacy

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper investigates the theoretical foundations and algorithmic design of KL-regularized reinforcement learning from human feedback (RLHF) under ε-local differential privacy (ε-LDP). Addressing both offline and online settings, it establishes the first unified theoretical framework for KL-regularized RLHF under LDP. For the offline setting, we propose a pessimistic algorithm achieving an optimal suboptimality gap of $ ilde{O}(1/[(e^varepsilon-1)^2 n])$ under single-policy concentrability. For the online setting, we design an optimistic algorithm attaining a logarithmic regret bound of $O(d_{mathcal{F}} log(N_{mathcal{F}} T)/(e^varepsilon-1)^2)$ under the eluder dimension assumption on the reward function class $mathcal{F}$, with matching lower bounds provided in both cases. Additionally, we deliver the first complete analysis of non-private online RLHF as a foundational baseline. All theoretical results are empirically validated via open-sourced implementation.

Technology Category

Application Category

📝 Abstract

In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $ε$ local differential privacy ($ε$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $ ilde{O}(1/[(e^ε-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{mathcal{F}}log (N_{mathcal{F}}cdot T) /(e^ε-1)^2 )$, where $T$ is the total time step, $N_{mathcal{F}}$ is cardinality of the reward function space $mathcal{F}$ and $d_{mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KL-regularized RLHF under local differential privacy constraints

Developing offline algorithms with suboptimality guarantees under privacy protection

Designing online RLHF methods with logarithmic regret under LDP

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularized RLHF under differential privacy

Pessimism-based offline algorithm with concentrability

Optimism-based online algorithm with logarithmic regret

🔎 Similar Papers

No similar papers found.