🤖 AI Summary
To address P/D (Prefill/Decode) load imbalance caused by request fluctuations when online services and offline tasks co-locate, and the inability of existing dynamic schedulers to adapt to bursty traffic, this paper proposes a latency-constrained resource pool separation architecture. Our method partitions the GPU cluster into two dedicated pools: a strict pool (guaranteeing low latency for online services) and a relaxed pool (optimized for high throughput of offline tasks). We design a bottleneck-aware scheduler grounded in the Roofline model to enable fine-grained matching of P/D workloads across pools. Additionally, we introduce a lightweight preemption mechanism to ensure millisecond-level SLO compliance for online requests. Experimental evaluation under real-world traffic demonstrates that our approach improves offline throughput by up to 3× while achieving 100% adherence to online latency SLOs.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services.
We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests.
Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.