Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper addresses online sequential decision-making under non-stationary environments with ultra-large (continuous or infinite) action spaces. We propose the first Bayesian optimization framework that simultaneously ensures dynamic adaptability and computational efficiency. Methodologically, we introduce a novel integration of Gaussian interpolation with a sliding time window to rapidly model non-stationary continuous reward functions; additionally, we impose Lipschitz continuity constraints to guarantee theoretical tractability. We prove that the algorithm achieves the optimal cumulative regret bound of $O^*(sqrt{T})$, strictly improving upon existing sliding-window Gaussian process approaches. Empirically, our method accelerates computation by two to four orders of magnitude (100×–10,000×) over state-of-the-art baselines, while attaining significantly lower regret and superior real-time performance in dynamic settings.

Technology Category

Application Category

📝 Abstract

Canonical algorithms for multi-armed bandits typically assume a stationary reward environment where the size of the action space (number of arms) is small. More recently developed methods typically relax only one of these assumptions: existing non-stationary bandit policies are designed for a small number of arms, while Lipschitz, linear, and Gaussian process bandit policies are designed to handle a large (or infinite) number of arms in stationary reward environments under constraints on the reward function. In this manuscript, we propose a novel policy to learn reward environments over a continuous space using Gaussian interpolation. We show that our method efficiently learns continuous Lipschitz reward functions with $mathcal{O}^*(sqrt{T})$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification. We finally demonstrate that our method is computationally favorable (100-10000x faster) and experimentally outperforms sliding Gaussian process policies on datasets with non-stationarity and an extremely large number of arms.

Problem

Research questions and friction points this paper is trying to address.

Optimizing nonstationary bandits with many arms

Learning continuous Lipschitz reward functions efficiently

Extending method to nonstationary problems effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Gaussian interpolation for continuous rewards

Achieves O*(sqrt(T)) cumulative regret

Extends to non-stationary problems efficiently

🔎 Similar Papers

Non-Stationary Latent Auto-Regressive Bandits