🤖 AI Summary
This paper studies a priority-driven, capacity-sharing stochastic multi-armed bandit problem: $M$ arms each possess random capacities, and $K$ heterogeneous jobs—endowed with distinct priority weights—compete for limited shared capacity via priority-based preemption. The problem is motivated by resource allocation in LLM inference scheduling and edge intelligence. The authors first formulate a priority-weighted nonlinear combinatorial utility function; develop a tail-sensitive regret analysis framework aligned with information-theoretic lower bounds; design the first provably optimal offline subroutine; and propose an approximate Upper Confidence Bound (UCB) online algorithm. Theoretically, the algorithm achieves a regret upper bound of $O(sqrt{K log(KT)})$, which is tight both instance-dependent and instance-independent; its computational complexity is $O(MK^3)$. The approach bridges theoretical optimality with practical scalability.
📝 Abstract
This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence, etc. The model is composed of $M$ arms and $K$ plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight. When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of $Ω( α_1 σsqrt{KM T} )$ and $Ω(α_1 σ^2 frac{M}Δ ln T)$ are proved, where $α_1$ is the largest priority weight and $σ$ characterizes the reward tail. When model parameters are given, we design an algorithm named exttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of $O(MK^3)$. Utilizing exttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of $ sqrt{K ln KT }$ and $α_1 K^2$ respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.