🤖 AI Summary
This work addresses the challenge of multi-user contextual bandits where users exhibit graph-structured relationships and the reward function is both nonlinear and graph-homophilous. The authors propose a unified learning framework based on joint regularization, introducing a novel regularizer that combines graph smoothness and individual roughness penalties. They establish, for the first time, its equivalence to a norm in a single multi-user reproducing kernel Hilbert space (RKHS) and explicitly construct a composite kernel that integrates the graph Laplacian with the base arm kernel. Building on this, they develop two efficient exploration algorithms, LK-GP-UCB and LK-GP-TS. Theoretical analysis yields high-probability regret bounds dependent only on the effective dimension of the multi-user kernel, eliminating dependence on the number of users or ambient dimensionality. Experiments demonstrate significant superiority over existing baselines in nonlinear settings while maintaining competitive performance in linear cases.
📝 Abstract
We study multi-user contextual bandits where users are related by a graph and their reward functions exhibit both non-linear behavior and graph homophily. We introduce a principled joint penalty for the collection of user reward functions $\{f_u\}$, combining a graph smoothness term based on RKHS distances with an individual roughness penalty. Our central contribution is proving that this penalty is equivalent to the squared norm within a single, unified \emph{multi-user RKHS}. We explicitly derive its reproducing kernel, which elegantly fuses the graph Laplacian with the base arm kernel. This unification allows us to reframe the problem as learning a single''lifted''function, enabling the design of principled algorithms, \texttt{LK-GP-UCB} and \texttt{LK-GP-TS}, that leverage Gaussian Process posteriors over this new kernel for exploration. We provide high-probability regret bounds that scale with an \emph{effective dimension} of the multi-user kernel, replacing dependencies on user count or ambient dimension. Empirically, our methods outperform strong linear and non-graph-aware baselines in non-linear settings and remain competitive even when the true rewards are linear. Our work delivers a unified, theoretically grounded, and practical framework that bridges Laplacian regularization with kernelized bandits for structured exploration.