Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge of statistical inference in multi-armed bandits, where adaptive sampling violates the independence assumption essential for classical inferential guarantees. The authors propose a regularized EXP3 algorithm grounded in the stochastic mirror descent framework, incorporating a logarithmic barrier regularizer and establishing a general stability criterion. This approach simultaneously ensures asymptotic normality of the estimators and nominal coverage of confidence intervals, while achieving a minimax-optimal regret bound up to logarithmic factors. Notably, the algorithm exhibits robustness against adversarial corruptions of magnitude up to $o(T^{1/2})$. To the best of our knowledge, this is the first method to unify efficient online learning with valid statistical inference in such settings.

Technology Category

Application Category

📝 Abstract

Statistical inference with bandit data presents fundamental challenges due to adaptive sampling, which violates the independence assumptions underlying classical asymptotic theory. Recent work has identified stability as a sufficient condition for valid inference under adaptivity. This paper develops a systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case. Our contributions are threefold. First, we establish a general stability criterion: if the average iterates of a stochastic mirror descent algorithm converge in ratio to a non-random probability vector, then the induced bandit algorithm is stable. This result provides a unified lens for analyzing stability across diverse algorithmic instantiations. Second, we introduce a family of regularized-EXP3 algorithms employing a log-barrier regularizer with appropriately tuned parameters. We prove that these algorithms satisfy our stability criterion and, as an immediate corollary, that Wald-type confidence intervals for linear functionals of the mean parameter achieve nominal coverage. Notably, we show that the same algorithms attain minimax-optimal regret guarantees up to logarithmic factors, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework. Third, we establish robustness to corruption: a modified variant of regularized-EXP3 maintains asymptotic normality of empirical arm means even in the presence of $o(T^{1/2})$ adversarial corruptions. This stands in sharp contrast to other stable algorithms such as UCB, which suffer linear regret even under logarithmic levels of corruption.

Problem

Research questions and friction points this paper is trying to address.

bandit inference

adaptive sampling

stability

statistical inference

adversarial corruption

Innovation

Methods, ideas, or system contributions that make the work stand out.

stability

regularized stochastic mirror descent

bandit inference