🤖 AI Summary
This work addresses the long-standing challenge of characterizing the Bayesian regret of Thompson sampling in linear Gaussian bandits, where prior specification and long-term performance are intricately coupled. We establish, for the first time, that the regret decomposes additively into a “warm-up” term—dependent on the prior covariance $\Sigma_0$—and a minimax-optimal long-term term of order $\tilde{O}(\sigma d \sqrt{T})$, rather than multiplicatively as previously conjectured. To this end, we introduce a novel elliptical potential lemma and prove a lower bound showing that the warm-up term $d r \sqrt{\mathrm{Tr}(\Sigma_0)}$ is unavoidable. Consequently, we obtain a total regret upper bound of $\tilde{O}(\sigma d \sqrt{T} + d r \sqrt{\mathrm{Tr}(\Sigma_0)})$, which significantly improves upon existing results.
📝 Abstract
We prove that Thompson sampling exhibits $\tilde{O}(\sigma d \sqrt{T} + d r \sqrt{\mathrm{Tr}(\Sigma_0)})$ Bayesian regret in the linear-Gaussian bandit with a $\mathcal{N}(\mu_0, \Sigma_0)$ prior distribution on the coefficients, where $d$ is the dimension, $T$ is the time horizon, $r$ is the maximum $\ell_2$ norm of the actions, and $\sigma^2$ is the noise variance. In contrast to existing regret bounds, this shows that to within logarithmic factors, the prior-dependent ``burn-in''term $d r \sqrt{\mathrm{Tr}(\Sigma_0)}$ decouples additively from the minimax (long run) regret $\sigma d \sqrt{T}$. Previous regret bounds exhibit a multiplicative dependence on these terms. We establish these results via a new ``elliptical potential''lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.