PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the challenge of differentiating SLO guarantees for multi-priority clients in interactive LLM serving, this paper proposes the first unified scheduling framework that maximizes service gain. It innovatively models the multi-priority request scheduling problem and introduces a two-tier cooperative mechanism: an upper tier jointly optimizes batching (SlideBatching) and routing (GoRouting) based on gain density and deadline constraints; a lower tier enables cross-instance gain-aware scheduling and distributed capacity reservation. Extensive experiments on four open-source datasets and real-world industrial traces demonstrate that our framework achieves up to 35% higher service gain and up to 52% improvement in SLO compliance rate compared to state-of-the-art schedulers—marking significant advances in fairness-aware, latency-sensitive LLM serving.

Technology Category

Application Category

📝 Abstract

The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first extit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine level, SlideBatching dynamically adapts batch formation and request ordering under varying load conditions, employing a sliding boundary mechanism to balance deadline-first and density-first strategies. At the service level, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation across four open-source datasets and a real-world industrial trace demonstrates that systemname{} consistently outperforms state-of-the-art baselines, improving system gain by up to 35% and boosting SLO attainment by up to 52%.

Problem

Research questions and friction points this paper is trying to address.

Schedules LLM requests with different client priorities

Maximizes service gain by meeting varied latency requirements

Balances batch formation and request ordering under load

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-tier scheduling framework maximizes service gain

SlideBatching adapts batch formation with sliding boundary

GoRouting dispatches requests with proactive capacity reservation

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing