Differentially Private Kernelized Contextual Bandits

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper studies the kernelized contextual bandit problem with stochastic contexts under joint differential privacy (JDP), aiming to learn an unknown reward function in a reproducing kernel Hilbert space (RKHS) while preserving privacy of both contexts and rewards. We propose a novel low-sensitivity kernel regression estimator, the first to achieve optimal dependence on the privacy parameter ε (i.e., O(1/ε)) in the cumulative error bound, and unify the trade-off between effective dimension γ_T and time horizon T. Theoretically, after T queries, the cumulative error is bounded by O(√(γ_T/T) + γ_T/(Tε)), significantly improving upon prior works. Our framework accommodates broad kernel families—including Gaussian and Matérn kernels—and integrates RKHS-based modeling, differentially private mechanism design, and high-dimensional statistical estimation. It yields the first solution for privacy-preserving contextual decision-making that simultaneously achieves optimal privacy–utility trade-offs and universal adaptability to general kernels.

Technology Category

Application Category

📝 Abstract

We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS). We study this problem under the additional constraint of joint differential privacy, where the agents needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that improves upon the state of the art and achieves an error rate of $mathcal{O}left(sqrt{frac{gamma_T}{T}} + frac{gamma_T}{T varepsilon} ight)$ after $T$ queries for a large class of kernel families, where $gamma_T$ represents the effective dimensionality of the kernel and $varepsilon>0$ is the privacy parameter. Our results are based on a novel estimator for the reward function that simultaneously enjoys high utility along with a low-sensitivity to observed rewards and contexts, which is crucial to obtain an order optimal learning performance with improved dependence on the privacy parameter.

Problem

Research questions and friction points this paper is trying to address.

Privacy Preservation

Multi-Armed Bandit

Error Rate Control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Privacy-Preserving Algorithm

Error Rate Reduction

Reward Prediction Method

🔎 Similar Papers

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare