Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Random exploration in nonlinear contextual bandits lacks theoretical guarantees and practical efficacy. Method: We propose two general ensemble sampling frameworks—GLM-ES (for generalized linear models) and Neural-ES (for neural networks)—that construct multiple estimators on perturbed data to enable efficient exploration. Contribution/Results: We establish the first provably sound ensemble sampling theory for nonlinear settings, eliminating restrictive fixed-horizon assumptions and supporting anytime-terminating online algorithms. By integrating maximum likelihood estimation with stochastic perturbation—and leveraging neural tangent kernel-based effective dimension analysis—we ensure convergence and support high-dimensional nonlinear reward modeling. Our theoretical analysis yields optimal frequentist regret bounds: $O(d^{3/2}sqrt{T})$ for GLM-ES and $ ilde{O}(d_{ ext{eff}}sqrt{T})$ for Neural-ES. Extensive experiments validate the robustness and practicality of both frameworks.

Technology Category

Application Category

📝 Abstract

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling ( exttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling ( exttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $mathcal{O}(d^{3/2} sqrt{T} + d^{9/2})$ for exttt{GLM-ES} and $mathcal{O}(widetilde{d} sqrt{T})$ for exttt{Neural-ES}, where $d$ is the dimension of feature vectors, $widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate exttt{GLM-ES}, exttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

Problem

Research questions and friction points this paper is trying to address.

Developing ensemble sampling algorithms for nonlinear contextual bandits

Providing regret bounds for generalized linear and neural bandits

Creating anytime algorithms without fixed-time horizon assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble sampling framework for nonlinear contextual bandits

Maximum likelihood estimation on perturbed reward models

Anytime algorithms removing fixed-time horizon assumptions

🔎 Similar Papers

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare