🤖 AI Summary
This paper investigates the conditions and mechanisms under which randomized exploration—specifically Thompson sampling—achieves optimal regret bounds in linear bandits. Focusing on $d$-dimensional linear environments with a smooth, strongly convex, compact action set, we establish, for the first time, that Thompson sampling attains the tight regret upper bound $mathcal{O}(dsqrt{n}log n)$ without requiring forced optimism or posterior inflation assumptions. Our analysis integrates Bayesian randomized reasoning, high-dimensional geometric characterization of the action space, and refined statistical inference techniques, revealing that the curvature of the action set positively regulates exploration efficiency. This result not only confirms the optimal dimension scaling of randomized strategies in structured linear bandits but also breaks the theoretical reliance on deterministic or optimism-based mechanisms. It provides a new paradigm for understanding the intrinsic efficacy of Bayesian exploration in sequential decision-making under uncertainty.
📝 Abstract
We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(dsqrt{n} log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.