🤖 AI Summary
In stochastic linear multi-armed bandits with infinite action sets and finite ensemble sizes, existing methods suffer from high computational overhead due to ensemble scaling linearly with time horizon $T$.
Method: This paper proposes a novel ensemble sampling framework that integrates Bayesian posterior approximation, linear function approximation, and concentration inequality analysis.
Contribution/Results: Differing from conventional approaches requiring $O(T)$ base learners, our method is the first to achieve a lightweight ensemble of only $O(d log T)$ learners in structured bandits—breaking the linear dependence on $T$. We establish a regret upper bound of $ ilde{O}((d log T)^{5/2} sqrt{T})$ in $d$-dimensional linear environments over horizon $T$, approaching the optimal $ ilde{O}(sqrt{T})$ benchmark. The framework naturally accommodates infinite action spaces and significantly improves computational efficiency, offering a new paradigm for scalable, high-dimensional, long-horizon online decision-making under large or infinite action sets.
📝 Abstract
We provide the first useful and rigorous analysis of ensemble sampling for the stochastic linear bandit setting. In particular, we show that, under standard assumptions, for a $d$-dimensional stochastic linear bandit with an interaction horizon $T$, ensemble sampling with an ensemble of size of order $d log T$ incurs regret at most of the order $(d log T)^{5/2} sqrt{T}$. Ours is the first result in any structured setting not to require the size of the ensemble to scale linearly with $T$ -- which defeats the purpose of ensemble sampling -- while obtaining near $smash{sqrt{T}}$ order regret. Our result is also the first to allow for infinite action sets.