🤖 AI Summary
Random exploration in nonlinear contextual bandits lacks theoretical guarantees and practical efficacy.
Method: We propose two general ensemble sampling frameworks—GLM-ES (for generalized linear models) and Neural-ES (for neural networks)—that construct multiple estimators on perturbed data to enable efficient exploration.
Contribution/Results: We establish the first provably sound ensemble sampling theory for nonlinear settings, eliminating restrictive fixed-horizon assumptions and supporting anytime-terminating online algorithms. By integrating maximum likelihood estimation with stochastic perturbation—and leveraging neural tangent kernel-based effective dimension analysis—we ensure convergence and support high-dimensional nonlinear reward modeling. Our theoretical analysis yields optimal frequentist regret bounds: $O(d^{3/2}sqrt{T})$ for GLM-ES and $ ilde{O}(d_{ ext{eff}}sqrt{T})$ for Neural-ES. Extensive experiments validate the robustness and practicality of both frameworks.
📝 Abstract
We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling ( exttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling ( exttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $mathcal{O}(d^{3/2} sqrt{T} + d^{9/2})$ for exttt{GLM-ES} and $mathcal{O}(widetilde{d} sqrt{T})$ for exttt{Neural-ES}, where $d$ is the dimension of feature vectors, $widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate exttt{GLM-ES}, exttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.