🤖 AI Summary
This work addresses the challenge of statistical inference in multi-armed bandits, where adaptive sampling violates the independence assumption essential for classical inferential guarantees. The authors propose a regularized EXP3 algorithm grounded in the stochastic mirror descent framework, incorporating a logarithmic barrier regularizer and establishing a general stability criterion. This approach simultaneously ensures asymptotic normality of the estimators and nominal coverage of confidence intervals, while achieving a minimax-optimal regret bound up to logarithmic factors. Notably, the algorithm exhibits robustness against adversarial corruptions of magnitude up to $o(T^{1/2})$. To the best of our knowledge, this is the first method to unify efficient online learning with valid statistical inference in such settings.
📝 Abstract
Statistical inference with bandit data presents fundamental challenges due to adaptive sampling, which violates the independence assumptions underlying classical asymptotic theory. Recent work has identified stability as a sufficient condition for valid inference under adaptivity. This paper develops a systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case. Our contributions are threefold. First, we establish a general stability criterion: if the average iterates of a stochastic mirror descent algorithm converge in ratio to a non-random probability vector, then the induced bandit algorithm is stable. This result provides a unified lens for analyzing stability across diverse algorithmic instantiations. Second, we introduce a family of regularized-EXP3 algorithms employing a log-barrier regularizer with appropriately tuned parameters. We prove that these algorithms satisfy our stability criterion and, as an immediate corollary, that Wald-type confidence intervals for linear functionals of the mean parameter achieve nominal coverage. Notably, we show that the same algorithms attain minimax-optimal regret guarantees up to logarithmic factors, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework. Third, we establish robustness to corruption: a modified variant of regularized-EXP3 maintains asymptotic normality of empirical arm means even in the presence of $o(T^{1/2})$ adversarial corruptions. This stands in sharp contrast to other stable algorithms such as UCB, which suffer linear regret even under logarithmic levels of corruption.