Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the risk-aware restless multi-armed bandit (RMAB) problem under resource constraints, aiming at robust dynamic decision-making. Addressing the limitations of conventional risk-neutral formulations, we establish the first sufficient condition for Whittle indexability under risk-sensitive objectives. Building upon this, we propose the first risk-aware Thompson sampling algorithm with provable regret guarantees. Under unknown transition probabilities, the algorithm achieves a sublinear cumulative regret bound of $O(sqrt{T}K^2)$, where $T$ denotes the time horizon and $K$ the number of arms. Empirical evaluations on equipment replacement and patient scheduling tasks demonstrate that our approach significantly reduces risk exposure while delivering theoretically grounded performance guarantees—establishing a novel paradigm for high-reliability sequential decision-making.

Technology Category

Application Category

📝 Abstract
In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.
Problem

Research questions and friction points this paper is trying to address.

Incorporating risk-awareness into restless multi-arm bandits.
Establishing indexability conditions for risk-aware objectives.
Proposing a Thompson sampling approach for unknown transitions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-aware objective in restless bandits
Whittle index solution for indexability
Thompson sampling for unknown transitions
🔎 Similar Papers
No similar papers found.
N
N. Akbarzadeh
HEC Montréal, Canada; GERAD, Canada; MILA - Quebec AI Institute, Canada
Erick Delage
Erick Delage
Professor, Department of Decision Sciences, HEC Montréal
Decision making under uncertaintyrobust optimizationstochastic programmingapplied statistics
Y
Y. Adulyasak
HEC Montréal, Canada; GERAD, Canada