🤖 AI Summary
To address the challenge of differentiating SLO guarantees for multi-priority clients in interactive LLM serving, this paper proposes the first unified scheduling framework that maximizes service gain. It innovatively models the multi-priority request scheduling problem and introduces a two-tier cooperative mechanism: an upper tier jointly optimizes batching (SlideBatching) and routing (GoRouting) based on gain density and deadline constraints; a lower tier enables cross-instance gain-aware scheduling and distributed capacity reservation. Extensive experiments on four open-source datasets and real-world industrial traces demonstrate that our framework achieves up to 35% higher service gain and up to 52% improvement in SLO compliance rate compared to state-of-the-art schedulers—marking significant advances in fairness-aware, latency-sensitive LLM serving.
📝 Abstract
The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities.
To bridge this gap, we first extit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine level, SlideBatching dynamically adapts batch formation and request ordering under varying load conditions, employing a sliding boundary mechanism to balance deadline-first and density-first strategies. At the service level, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation across four open-source datasets and a real-world industrial trace demonstrates that systemname{} consistently outperforms state-of-the-art baselines, improving system gain by up to 35% and boosting SLO attainment by up to 52%.