🤖 AI Summary
This paper studies the non-stationary online restless multi-armed bandit (RMAB) problem, where arm dynamics and rewards evolve over time under a total variation budget $B$. Addressing the failure of classical RMAB algorithms in dynamic domains such as healthcare and recommendation, we establish the first theoretical framework for non-stationary RMAB. Our method integrates sliding-window estimation with upper confidence bound (UCB) principles and introduces a relaxed regret metric tailored to non-stationary environments. We derive a tight $widetilde{mathcal{O}}(N^2 B^{1/4} T^{3/4})$ dynamic regret bound—substantially improving upon static-baseline guarantees. Experiments confirm robustness and practical efficacy under state drift. Key contributions include: (i) the first comprehensive theoretical foundation for non-stationary RMAB; (ii) variation-budget-driven algorithm design; and (iii) provably sublinear dynamic regret.
📝 Abstract
Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider $N$-armd RMAB with non-stationary transition constrained by bounded variation budgets $B$. Our proposed
mab; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that
mab; achieves $widetilde{mathcal{O}}(N^2 B^{frac{1}{4}} T^{frac{3}{4}})$ regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.