MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the limitation of traditional Restless Multi-Armed Bandits (RMABs) — namely, their static environment assumption and inability to model dynamic evolution in non-stationary settings — this paper proposes MARBLE, a framework that models latent environmental shifts via hidden Markov states, enabling arm dynamics to depend on unobserved, time-varying environment states. Methodologically, we introduce Markov-Averaged Indexability (MAI), a novel indexability criterion, and prove for the first time that the integration of synchronous Q-learning with Whittle indices converges almost surely to the optimal policy in hidden-state-driven non-stationary RMABs. Theoretically, this extends the applicability of Whittle indices beyond stationary regimes. Empirically, we evaluate our QWI algorithm within a digital twin–embedded recommendation system simulator, demonstrating rapid adaptation to latent state changes and efficient policy learning. Results show substantial improvements in modeling fidelity and practical utility of RMABs in realistic dynamic environments.

Technology Category

Application Category

📝 Abstract

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Modeling restless bandits with latent Markovian environmental state transitions

Addressing nonstationary dynamics in multi-armed bandit decision-making problems

Developing convergent algorithms for unobserved regime switching environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments RMABs with latent Markov state

Introduces Markov-Averaged Indexability criterion

Uses Q-learning with Whittle Indices convergence

🔎 Similar Papers

Non-Stationary Latent Auto-Regressive Bandits