Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

πŸ“… 2025-10-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing approximate Thompson Sampling (TS) algorithms rely on Stochastic Gradient Langevin Dynamics (SGLD) sampling updated per round, requiring dynamic hyperparameter tuning and suffering from intricate theoretical analysis. Method: We propose TS-SA, a framework unifying Stochastic Approximation (SA) with Langevin Monte Carlo: it introduces a stabilized posterior objective, enabling construction of a fixed-form approximate posterior using only the most recent reward per roundβ€”thus supporting a unified constant step size and tractable convergence analysis. TS-SA employs single-step Langevin updates, a warm-up mechanism, and time-averaged estimation to enhance sampling efficiency and stability. Contribution/Results: We establish a near-optimal $O(sqrt{T})$ regret bound. Empirical results demonstrate that even with single-step updates and warm-up, TS-SA significantly outperforms existing approximate TS methods, achieving breakthroughs in both theoretical tractability and practical efficacy.

Technology Category

Application Category

πŸ“ Abstract
Most existing approximate Thompson Sampling (TS) algorithms for multi-armed bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in each round to sample from the posterior, relaxing the need for conjugacy assumptions between priors and reward distributions in vanilla TS. However, they often require approximating a different posterior distribution in different round of the bandit problem. This requires tricky, round-specific tuning of hyperparameters such as dynamic learning rates, causing challenges in both theoretical analysis and practical implementation. To alleviate this non-stationarity, we introduce TS-SA, which incorporates stochastic approximation (SA) within the TS framework. In each round, TS-SA constructs a posterior approximation only using the most recent reward(s), performs a Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy proposals over time. This can be interpreted as approximating a stationary posterior target throughout the entire algorithm, which further yields a fixed step-size, a unified convergence analysis framework, and improved posterior estimates through temporal averaging. We establish near-optimal regret bounds for TS-SA, with a simplified and more intuitive theoretical analysis enabled by interpreting the entire algorithm as a simulation of a stationary SGLD process. Our empirical results demonstrate that even a single-step Langevin update with certain warm-up outperforms existing methods substantially on bandit tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses non-stationary posterior approximation in Thompson Sampling
Eliminates tricky round-specific hyperparameter tuning in bandit algorithms
Improves posterior estimates through temporal averaging techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic approximation for stationary posterior target
Fixed step-size with unified convergence analysis
Single-step Langevin update with warm-up outperforms
πŸ”Ž Similar Papers
No similar papers found.