Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This paper addresses the restless multi-armed bandit (RMAB) resource allocation problem under non-stationary environments with unknown transition kernels, where Whittle index computation is infeasible. To tackle this, we propose the first online learning algorithm for Whittle indices: it integrates sliding-window estimation of time-varying transition probabilities, a linear optimization-based predictive model, an upper-confidence-bound (UCB) exploration mechanism, and RMAB-structured priors to accelerate convergence. We establish a sublinear dynamic regret bound. Experiments demonstrate that our method achieves significantly lower cumulative regret than state-of-the-art baselines across diverse non-stationary settings, while maintaining computational efficiency and robustness. The framework provides a scalable theoretical and practical foundation for sequential decision-making in real-world time-varying systems—such as dynamic network scheduling and adaptive healthcare interventions.

Technology Category

Application Category

📝 Abstract

We consider optimal resource allocation for restless multi-armed bandits (RMABs) in unknown, non-stationary settings. RMABs are PSPACE-hard to solve optimally, even when all parameters are known. The Whittle index policy is known to achieve asymptotic optimality for a large class of such problems, while remaining computationally efficient. In many practical settings, however, the transition kernels required to compute the Whittle index are unknown and non-stationary. In this work, we propose an online learning algorithm for Whittle indices in this setting. Our algorithm first predicts current transition kernels by solving a linear optimization problem based on upper confidence bounds and empirical transition probabilities calculated from data over a sliding window. Then, it computes the Whittle index associated with the predicted transition kernels. We design these sliding windows and upper confidence bounds to guarantee sub-linear dynamic regret on the number of episodes $T$, under the condition that transition kernels change slowly over time (rate upper bounded by $ε=1/T^k$ with $k>0$). Furthermore, our proposed algorithm and regret analysis are designed to exploit prior domain knowledge and structural information of the RMABs to accelerate the learning process. Numerical results validate that our algorithm achieves superior performance in terms of lowest cumulative regret relative to baselines in non-stationary environments.

Problem

Research questions and friction points this paper is trying to address.

Online learning of Whittle indices for non-stationary RMABs

Optimal resource allocation in unknown, changing environments

Sub-linear dynamic regret guarantee for slow kernel changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online learning algorithm for Whittle indices

Sliding window with upper confidence bounds

Exploits prior domain knowledge for acceleration

🔎 Similar Papers

Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits