OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address P/D (Prefill/Decode) load imbalance caused by request fluctuations when online services and offline tasks co-locate, and the inability of existing dynamic schedulers to adapt to bursty traffic, this paper proposes a latency-constrained resource pool separation architecture. Our method partitions the GPU cluster into two dedicated pools: a strict pool (guaranteeing low latency for online services) and a relaxed pool (optimized for high throughput of offline tasks). We design a bottleneck-aware scheduler grounded in the Roofline model to enable fine-grained matching of P/D workloads across pools. Additionally, we introduce a lightweight preemption mechanism to ensure millisecond-level SLO compliance for online requests. Experimental evaluation under real-world traffic demonstrates that our approach improves offline throughput by up to 3× while achieving 100% adherence to online latency SLOs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.

Problem

Research questions and friction points this paper is trying to address.

Addresses load imbalance in Prefill/Decode disaggregated LLM serving systems

Mitigates performance issues from co-locating online and offline LLM workloads

Enhances offline throughput while preserving online service latency SLOs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latency-disaggregated architecture with strict and relaxed resource pools

Bottleneck-based scheduler using Roofline performance model

Fast preemption mechanism to enforce online request SLOs

🔎 Similar Papers

No similar papers found.