🤖 AI Summary
This paper studies the linear stochastic multi-armed bandit problem with ellipsoidal action sets. We establish, for the first time, a tight minimax information-theoretic lower bound and propose the first locally asymptotically minimax-optimal algorithm. Departing from conventional optimism- or sampling-based paradigms, our method introduces a novel “sequential estimation + explore-then-commit” framework: it employs a new sequential estimator based on the θ-norm for efficient parameter updates, integrated with matrix-norm analysis (‖θ‖_A) and computationally tractable ellipsoidal geometry. Theoretically, the algorithm achieves a regret bound of Ω(min{dσ√T + d‖θ‖_A, ‖θ‖_A T}), which is tight and unimprovable. Its computational complexity is O(dT + d² log(T/d) + d³) time and O(d²) memory. Extensive experiments confirm both its theoretically optimal regret behavior and strong empirical performance.
📝 Abstract
We consider linear stochastic bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $Omega(min(d sigma sqrt{T} + d | heta|_{A}, | heta|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $sigma^2$ the noise variance, $A$ a matrix defining the set of actions and $ heta$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant. The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm. The main idea is to combine a novel sequential procedure to estimate $| heta|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality. We further provide numerical experiments to illustrate our theoretical findings.