An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This paper studies the Bayesian regret bound of Thompson sampling for logistic bandits, where binary rewards follow a logistic link function and the optimal action depends on a high-dimensional parameter. Methodologically, it introduces a novel upper bound on the information ratio that decouples the dependence on the smoothness parameter β and reveals the critical role of alignment α between the action and parameter spaces. The analysis leverages an information-theoretic framework to derive a Bayesian expected regret bound of $Oig(d/(alpha)sqrt{T log(eta T / d)}ig)$. Crucially, this bound exhibits only logarithmic dependence on β—improving upon prior linear or exponential dependencies—and avoids explicit dependence on the number of actions. When the action space contains the parameter space, the bound simplifies to $ ilde{O}(dsqrt{T})$, substantially improving over existing results. This work provides the first tight, β-robust regret guarantee for Thompson sampling in logistic bandits.

Technology Category

Application Category

📝 Abstract

We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $exp(eta langle a, heta angle)/(1+exp(eta langle a, heta angle))$, with slope parameter $eta>0$, and where both the action $ain mathcal{A}$ and parameter $ heta in mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $ frac{9}{2}dalpha^{-2}$, where $alpha$ is a minimax measure of the alignment between the action space $mathcal{A}$ and the parameter space $mathcal{O}$, and is independent of $eta$. Using this result, we derive a bound of order $O(d/alphasqrt{T log(eta T/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $eta$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $ ilde{O}(d sqrt{T})$.

Problem

Research questions and friction points this paper is trying to address.

Analyze Thompson Sampling in logistic bandits.

Study information ratio and regret trade-off.

Derive Bayesian expected regret bounds.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thompson Sampling algorithm

logistic bandit problems

information-theoretic framework

🔎 Similar Papers

No similar papers found.