🤖 AI Summary
This paper addresses the joint optimization of cumulative reward maximization, fairness (i.e., lower bounds on per-arm average rewards), and regularity (i.e., lower bounds on per-arm selection frequencies) in combinatorial multi-armed bandits—motivated by critical applications such as wireless resource scheduling. We propose a parameterized algorithm integrating virtual queues, time-since-last-reward (TSLR) metrics, and upper confidence bound (UCB) principles, and analyze it rigorously via Lyapunov drift theory. To the best of our knowledge, this is the first work to simultaneously achieve zero cumulative fairness violation, strict regularity guarantees, and $O(sqrt{T})$ sublinear regret—establishing a novel theoretical trade-off frontier. Extensive experiments on two real-world wireless network datasets validate both the efficacy and practical applicability of the proposed approach.
📝 Abstract
The combinatorial multi-armed bandit model is designed to maximize cumulative rewards in the presence of uncertainty by activating a subset of arms in each round. This paper is inspired by two critical applications in wireless networks, where it's not only essential to maximize cumulative rewards but also to guarantee fairness among arms (i.e., the minimum average reward required by each arm) and ensure reward regularity (i.e., how often each arm receives the reward). In this paper, we propose a parameterized regular and fair learning algorithm to achieve these three objectives. In particular, the proposed algorithm linearly combines virtual queue-lengths (tracking the fairness violations), Time-Since-Last-Reward (TSLR) metrics, and Upper Confidence Bound (UCB) estimates in its weight measure. Here, TSLR is similar to age-of-information and measures the elapsed number of rounds since the last time an arm received a reward, capturing the reward regularity performance, and UCB estimates are utilized to balance the tradeoff between exploration and exploitation in online learning. By exploring a key relationship between virtual queue-lengths and TSLR metrics and utilizing several non-trivial Lyapunov functions, we analytically characterize zero cumulative fairness violation, reward regularity, and cumulative regret performance under our proposed algorithm. These theoretical outcomes are verified by simulations based on two real-world datasets.