Nonstationary Bandit Learning via Predictive Sampling

📅 2022-05-04

🏛️ International Conference on Artificial Intelligence and Statistics

📈 Citations: 19

✨ Influential: 3

career value

215K/year

🤖 AI Summary

This paper addresses the failure of Thompson sampling to maintain effective exploration in non-stationary multi-armed bandits due to its neglect of information timeliness. We propose Predictive Sampling, the first method to explicitly incorporate timeliness modeling into the Bayesian decision framework. It achieves adaptive exploration prioritization via dynamic prior updating and timeliness-weighted sampling. Theoretically, we derive the first Bayesian regret upper bound applicable to non-stationary environments and prove its boundedness. Computationally, we design a scalable approximate posterior inference mechanism. Experiments across diverse non-stationary settings—including abrupt and gradual distributional shifts—demonstrate that Predictive Sampling significantly outperforms classical Thompson sampling. The algorithm exhibits strong convergence properties, robustness to environmental dynamics, and practical scalability, making it suitable for real-world deployment.

📝 Abstract

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. A theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor Thompson sampling performance in non-stationary bandit environments

Proposes predictive sampling to prioritize long-term useful information

Provides scalable algorithms with theoretical guarantees for non-stationary bandits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive sampling algorithm for non-stationary bandits

Deprioritizes quickly outdated information acquisition

Scalable computation for complex bandit environments

🔎 Similar Papers

Non-Stationary Latent Auto-Regressive Bandits