Optimism Stabilizes Thompson Sampling for Adaptive Inference

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that standard Thompson sampling under adaptive data collection induces coupling between arm selection and rewards, resulting in random pull counts that violate the stability conditions required for valid asymptotic inference. To resolve this, the authors propose two optimistic variants of Thompson sampling—variance-inflated and posterior-mean-plus-reward—that incorporate optimism to concentrate the number of pulls for each arm around a deterministic scale, thereby satisfying stability. The paper extends stability theory from the two-armed to the general K-armed Gaussian multi-armed bandit setting, establishing for the first time that the proposed methods enable asymptotically valid inference while incurring only a modest increase in regret. This resolves an open problem in the field concerning the compatibility of Thompson sampling with reliable post-experiment statistical inference.

Technology Category

Application Category

📝 Abstract
Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.
Problem

Research questions and friction points this paper is trying to address.

Thompson sampling
adaptive inference
multi-armed bandits
asymptotic inference
stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thompson sampling
optimism
stability
adaptive inference
multi-armed bandits
S
Shunxing Yan
Peking University
Han Zhong
Han Zhong
Peking University
Machine Learning