Fooling Algorithms in Non-Stationary Bandits using Belief Inertia

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work investigates worst-case regret lower bounds in piecewise-stationary multi-armed bandits. Addressing the limitation of existing lower bounds—namely, their reliance on the “infrequent sampling” assumption—the paper introduces the novel concept of *belief inertia*, which formally captures the algorithmic response delay induced by historical mean estimates. Through adversarial changepoint instance construction and rigorous modeling of empirical belief dynamics, the authors prove that even under a single environment change and optimal parameter tuning, classical algorithms—including ETC, ε-greedy, and UCB—necessarily incur Ω(T) linear regret. Crucially, restart-based strategies cannot circumvent this fundamental barrier. The analysis transcends conventional frameworks by identifying delayed belief updating—not merely insufficient exploration—as the root cause of failure in changepoint adaptation. This work provides a critical theoretical warning and establishes a new analytical paradigm for designing truly robust non-stationary bandit algorithms.

Technology Category

Application Category

📝 Abstract

We study the problem of worst case regret in piecewise stationary multi armed bandits. While the minimax theory for stationary bandits is well established, understanding analogous limits in time-varying settings is challenging. Existing lower bounds rely on what we refer to as infrequent sampling arguments, where long intervals without exploration allow adversarial reward changes that induce large regret. In this paper, we introduce a fundamentally different approach based on a belief inertia argument. Our analysis captures how an algorithm's empirical beliefs, encoded through historical reward averages, create momentum that resists new evidence after a change. We show how this inertia can be exploited to construct adversarial instances that mislead classical algorithms such as Explore Then Commit, epsilon greedy, and UCB, causing them to suffer regret that grows linearly with T and with a substantial constant factor, regardless of how their parameters are tuned, even with a single change point. We extend the analysis to algorithms that periodically restart to handle non stationarity and prove that, even then, the worst case regret remains linear in T. Our results indicate that utilizing belief inertia can be a powerful method for deriving sharp lower bounds in non stationary bandits.

Problem

Research questions and friction points this paper is trying to address.

Studying worst-case regret in piecewise stationary multi-armed bandits

Analyzing belief inertia causing algorithms to resist environmental changes

Proving linear regret persists despite parameter tuning or periodic restarts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Belief inertia exploits historical reward momentum

Adversarial instances mislead classical bandit algorithms

Linear regret proven even with periodic restarts

🔎 Similar Papers

Non-Stationary Latent Auto-Regressive Bandits