Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the resource allocation problem in dynamic wireless networks under heterogeneous individual penalty constraints—such as energy consumption, activation frequency, or minimum information freshness—and proposes a novel Penalty-Optimal Whittle (POW) index policy. Built upon Whittle index theory, the POW index depends solely on each user’s own state transition kernel and constraints, requires no online parameter tuning, and incurs computational complexity independent of system scale. It is the first scalable index policy to achieve near-optimal performance while strictly satisfying individual penalty constraints, with provable asymptotic optimality in large-scale systems. By integrating deep reinforcement learning, POW enables efficient offline precomputation and seamless online deployment, significantly outperforming existing methods across diverse scenarios while rigorously adhering to all individual constraints.

Technology Category

Application Category

📝 Abstract

This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW Indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.

Problem

Research questions and friction points this paper is trying to address.

Restless Multi-Armed Bandit

individual penalty constraints

resource allocation

wireless networks

performance constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Restless Multi-Armed Bandit

Penalty Constraints

Whittle Index