Lagrangian Index Policy for Restless Bandits with Average Reward

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the non-stationary Restless Multi-Armed Bandit (RMAB) problem under the average-reward criterion. To address the failure of Whittle’s Index Policy (WIP) in degenerate scenarios and its high memory overhead, we propose the Lagrangian Index Policy (LIP)—the first index policy grounded in Lagrangian duality and exchangeability analysis. Leveraging the de Finetti theorem, we establish its asymptotic optimality in the homogeneous-arm limit. LIP supports online learning under model uncertainty and admits unified implementation via tabular Q-learning or neural-network-based RL, reducing memory consumption by an order of magnitude. For restart-type models—including web crawling and weighted Age-of-Information minimization—we derive closed-form LIP indices analytically. Experiments demonstrate that LIP maintains high robustness and near-optimality even when WIP collapses, significantly outperforming existing approaches in challenging non-stationary regimes.

Technology Category

Application Category

📝 Abstract
We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP requires significantly less memory than the analogous scheme for WIP. We calculate analytically the Lagrangian index for the restart model, which describes the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous bandits as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.
Problem

Research questions and friction points this paper is trying to address.

Compares LIP and WIP for restless bandits' average reward
Proposes memory-efficient RL algorithms for LIP in model-free settings
Analyzes Lagrange index for restart model in web crawling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lagrange Index Policy for restless bandits
Reinforcement learning for model-free LIP
Analytical Lagrange index for restart model
🔎 Similar Papers
No similar papers found.
Konstantin Avrachenkov
Konstantin Avrachenkov
Director of Research, INRIA Sophia Antipolis
Applied ProbabilityMarkov ChainsSingular PerturbationsNetworksMachine Learning
V
Vivek S. Borkar
Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
P
Pratik Shah
Department of Mechanical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India. (Now graduate student at Georgia Institute of Technology, USA.)