A Broader View of Thompson Sampling

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The intrinsic exploration-exploitation trade-off in Thompson Sampling (TS) lacks a clear mechanistic explanation, particularly regarding how posterior sampling achieves long-term balance. Method: We recast TS as an online optimization problem and propose the “fidelity stabilization” framework, which approximates finite-horizon dynamic decision-making via static optimization—enabling structural analysis through the Bellman principle. Contribution/Results: Our theoretical analysis reveals an essential isomorphism between TS policies and Bellman-optimal policies. We introduce point-biserial correlation to quantify residual uncertainty, yielding an interpretable metric for the exploration-exploitation balance. This framework unifies Bayesian optimization, posterior sampling, and dynamic programming perspectives, while preserving low regret. It significantly enhances algorithmic interpretability and theoretical scalability—enabling rigorous generalization beyond standard bandit settings to broader sequential decision-making problems.

Technology Category

Application Category

📝 Abstract
Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as "faithful" stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which "closely resembles" the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman's principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson's original idea.
Problem

Research questions and friction points this paper is trying to address.

Explains Thompson Sampling's exploration-exploitation balance mechanism
Reformulates bandit problem as stationary online optimization objective
Reveals Thompson Sampling mimics Bellman-optimal policy structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recasting Thompson Sampling as online optimization algorithm
Introducing faithful stationarization of regret formulation
Regularizing greediness with residual uncertainty measure
🔎 Similar Papers
No similar papers found.
Y
Yanlin Qu
Columbia Business School
Hongseok Namkoong
Hongseok Namkoong
Columbia University
AISequential Decision-making
A
Assaf Zeevi
Columbia Business School