🤖 AI Summary
This paper studies the Bayesian regret bound of Thompson sampling for logistic bandits, where binary rewards follow a logistic link function and the optimal action depends on a high-dimensional parameter. Methodologically, it introduces a novel upper bound on the information ratio that decouples the dependence on the smoothness parameter β and reveals the critical role of alignment α between the action and parameter spaces. The analysis leverages an information-theoretic framework to derive a Bayesian expected regret bound of $Oig(d/(alpha)sqrt{T log(eta T / d)}ig)$. Crucially, this bound exhibits only logarithmic dependence on β—improving upon prior linear or exponential dependencies—and avoids explicit dependence on the number of actions. When the action space contains the parameter space, the bound simplifies to $ ilde{O}(dsqrt{T})$, substantially improving over existing results. This work provides the first tight, β-robust regret guarantee for Thompson sampling in logistic bandits.
📝 Abstract
We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $exp(eta langle a, heta
angle)/(1+exp(eta langle a, heta
angle))$, with slope parameter $eta>0$, and where both the action $ain mathcal{A}$ and parameter $ heta in mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $ frac{9}{2}dalpha^{-2}$, where $alpha$ is a minimax measure of the alignment between the action space $mathcal{A}$ and the parameter space $mathcal{O}$, and is independent of $eta$. Using this result, we derive a bound of order $O(d/alphasqrt{T log(eta T/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $eta$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $ ilde{O}(d sqrt{T})$.