π€ AI Summary
This work addresses the trade-off among latency-sensitive service-level objectives, redundant capacity costs, and endogenous workload dynamics induced by pricing in multi-tenant GPU cloud platforms. The authors formulate joint pricing and scaling decisions as a mean-field Stackelberg game and derive the equilibrium demand mapping. They uncover, for the first time, a structural failure mode wherein delay-insensitive tasks cause unresolvable backlogs, and propose a verifiable drainability guardrail together with an optimizer-agnostic action masking mechanism to guarantee a uniformly negative drift in residual demand regions. Theoretically, they prove that for any price-capacity pair satisfying the guardrail, the system admits a unique steady state and converges globally to it. Experiments demonstrate that the proposed approach substantially enhances the safety and robustness of reinforcement learning policies in dynamic environments.
π Abstract
Modern Graphics Processing Unit (GPU)-backed services must satisfy strict latency service-level objectives (SLOs) while controlling spare-capacity cost. In multi-tenant GPU cloud platforms, this trade-off is inherently dynamic because workload demand is endogenous; specifically, pricing shapes the submissions of heterogeneous tenants, which subsequently impact congestion and delay. We formulate the joint pricing-and-scaling problem as a large-population Stackelberg game problem, and we derive an explicit equilibrium demand map. The resulting closed-loop model reveals a structural failure mode in which delay-insensitive workloads sustain a residual demand floor, making the backlog undrainable under bounded price and service capacity. This observation motivates a computable drainability guardrail that certifies uniformly negative drift in the residual-demand regime. For any fixed price-capacity pair satisfying the drainability guardrail, we establish a unique operating point and global convergence towards it under a checkable step-size condition. Building on this fixed-pair analysis, we further develop an optimizer-agnostic action shield for the full dynamic problem and show empirically that it improves safety and robustness for model-free reinforcement learning (RL) in this setting.