Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This paper studies a priority-driven, capacity-sharing stochastic multi-armed bandit problem: $M$ arms each possess random capacities, and $K$ heterogeneous jobs—endowed with distinct priority weights—compete for limited shared capacity via priority-based preemption. The problem is motivated by resource allocation in LLM inference scheduling and edge intelligence. The authors first formulate a priority-weighted nonlinear combinatorial utility function; develop a tail-sensitive regret analysis framework aligned with information-theoretic lower bounds; design the first provably optimal offline subroutine; and propose an approximate Upper Confidence Bound (UCB) online algorithm. Theoretically, the algorithm achieves a regret upper bound of $O(sqrt{K log(KT)})$, which is tight both instance-dependent and instance-independent; its computational complexity is $O(MK^3)$. The approach bridges theoretical optimality with practical scalability.

Technology Category

Application Category

📝 Abstract

This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence, etc. The model is composed of $M$ arms and $K$ plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight. When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of $Ω( α_1 σsqrt{KM T} )$ and $Ω(α_1 σ^2 frac{M}Δ ln T)$ are proved, where $α_1$ is the largest priority weight and $σ$ characterizes the reward tail. When model parameters are given, we design an algorithm named exttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of $O(MK^3)$. Utilizing exttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of $ sqrt{K ln KT }$ and $α_1 K^2$ respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.

Problem

Research questions and friction points this paper is trying to address.

Develops a multi-play bandit model with prioritized capacity sharing for resource allocation

Proves regret lower bounds and designs algorithms for optimal play allocation

Addresses technical challenges in learning under nonlinear combinatorial utility functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritized capacity sharing with weighted plays

Offline optimal policy algorithm with cubic complexity

Approximate UCB algorithm matching regret bounds

🔎 Similar Papers

Multi-Player Approaches for Dueling Bandits