🤖 AI Summary
This paper studies kernelized bandits in a reproducing kernel Hilbert space (RKHS), where the reward function has bounded RKHS norm and the action set is a compact subset of ℝᵈ. To address the exploration–exploitation trade-off, we propose GP-Generic, a unified algorithmic framework whose key innovation is the introduction of an “exploration distribution” that subsumes both upper-confidence-bound (UCB) and randomized strategies. GP-Generic achieves the theoretically optimal cumulative regret bound 𝒪(γₜ√T), where γₜ denotes the maximum information gain. The framework is agnostic to kernel choice and sampling mechanism, and attains asymptotically matching regret bounds to those of UCB and Thompson Sampling under mild conditions. Empirical results demonstrate that properly designed stochastic exploration distributions significantly outperform deterministic policies, achieving both theoretical optimality and practical efficiency.
📝 Abstract
We consider a kernelized bandit problem with a compact arm set ${X} subset mathbb{R}^d $ and a fixed but unknown reward function $f^*$ with a finite norm in some Reproducing Kernel Hilbert Space (RKHS). We propose a class of computationally efficient kernelized bandit algorithms, which we call GP-Generic, based on a novel concept: exploration distributions. This class of algorithms includes Upper Confidence Bound-based approaches as a special case, but also allows for a variety of randomized algorithms. With careful choice of exploration distribution, our proposed generic algorithm realizes a wide range of concrete algorithms that achieve $ ilde{O}(gamma_Tsqrt{T})$ regret bounds, where $gamma_T$ characterizes the RKHS complexity. This matches known results for UCB- and Thompson Sampling-based algorithms; we also show that in practice, randomization can yield better practical results.