KL-regularization Itself is Differentially Private in Bandits and RLHF

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three sequential decision-making problems—multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF)—by introducing a novel differentially private (DP) paradigm that requires no explicit noise injection. The core method leverages the inherent smoothing effect of KL-divergence regularization in policy optimization, and provides the first rigorous theoretical proof that appropriately scaled KL regularization naturally satisfies ε-differential privacy, with the privacy budget ε precisely controllable via the regularization coefficient. This “free privacy” mechanism circumvents the performance degradation typically induced by conventional DP approaches reliant on additive noise, preserving both policy convergence guarantees and practical utility while ensuring per-sample privacy. Empirical evaluation on offline decision-making tasks validates its effectiveness.

Technology Category

Application Category

📝 Abstract
Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.
Problem

Research questions and friction points this paper is trying to address.

Achieving differential privacy without noise injection
Exploring KL-regularization in bandits and RLHF
Privacy-preserving stochastic policy via regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularization achieves DP without noise
DP via stochastic policy sampling in bandits
Privacy-preserving RLHF through KL-regularized objectives
🔎 Similar Papers
No similar papers found.