OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address P/D (Prefill/Decode) load imbalance caused by request fluctuations when online services and offline tasks co-locate, and the inability of existing dynamic schedulers to adapt to bursty traffic, this paper proposes a latency-constrained resource pool separation architecture. Our method partitions the GPU cluster into two dedicated pools: a strict pool (guaranteeing low latency for online services) and a relaxed pool (optimized for high throughput of offline tasks). We design a bottleneck-aware scheduler grounded in the Roofline model to enable fine-grained matching of P/D workloads across pools. Additionally, we introduce a lightweight preemption mechanism to ensure millisecond-level SLO compliance for online requests. Experimental evaluation under real-world traffic demonstrates that our approach improves offline throughput by up to 3× while achieving 100% adherence to online latency SLOs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.
Problem

Research questions and friction points this paper is trying to address.

Addresses load imbalance in Prefill/Decode disaggregated LLM serving systems
Mitigates performance issues from co-locating online and offline LLM workloads
Enhances offline throughput while preserving online service latency SLOs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latency-disaggregated architecture with strict and relaxed resource pools
Bottleneck-based scheduler using Roofline performance model
Fast preemption mechanism to enforce online request SLOs
🔎 Similar Papers
No similar papers found.
S
Siyu Wu
Beihang University
Z
Zihan Tang
Tsinghua University
Y
Yuting Zeng
University of Science and Technology of China
H
Hui Chen
Tsinghua University
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval
Tongxuan Liu
Tongxuan Liu
University of Science and Technology of China
LLM Logic ReasoningMulti-AgentsLLM Inference SystemLVLMRecommender System
K
Ke Zhang
JD Company
H
Hailong Yang
Beihang University