HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) serving faces significant scheduling and elastic scaling challenges due to highly heterogeneous request lengths, priorities, and stage-specific service-level objectives (SLOs)—notably divergent SLOs for prefill versus decode phases. This paper introduces HyperFlexis, the first unified system enabling joint multi-SLO–aware scheduling and cost-sensitive autoscaling. Its core innovations include: (1) multi-SLO–aware dynamic scheduling; (2) cross-instance KV cache migration; (3) dynamic binding of prefill/decode instances with millisecond-scale role switching; (4) budget-driven cold-start optimization; (5) direct device-to-device weight transmission; and (6) fine-grained priority partitioning. Experiments demonstrate that HyperFlexis improves SLO compliance by up to 4.44× over state-of-the-art systems, reduces P99 latency by 65.82%, significantly enhances resource utilization, and maintains equivalent cost efficiency.

Technology Category

Application Category

📝 Abstract
Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present extbf{HyperFlexis}, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to extbf{19.39$ imes$}. These optimizations allow the system to achieve up to extbf{4.44$ imes$} higher SLO attainment, extbf{65.82%} lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Optimizing scheduling and scaling for multi-SLO LLM serving
Reducing cold-start latency with efficient weight transfer
Supporting both collocated and disaggregated Prefill/Decode architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-SLO-aware scheduler with budget estimation
Device-to-device weight transfer mechanism
Prefill-decode instance linking during scaling
🔎 Similar Papers
No similar papers found.
Z
Zahra Yousefijamarani
Huawei Technologies Canada Co., Ltd., Vancouver, Canada
Xinglu Wang
Xinglu Wang
PhD student, SFU
OptimizationMulti-task learning
Q
Qian Wang
Huawei Technologies Canada Co., Ltd., Vancouver, Canada
M
Morgan Lindsay Heisler
Huawei Technologies Canada Co., Ltd., Vancouver, Canada
T
Taha Shabani
Huawei Technologies Canada Co., Ltd., Vancouver, Canada
Niloofar Gholipour
Niloofar Gholipour
École de technologie supérieure, Montréal, Canada
Parham Yassini
Parham Yassini
School of Computing Science, Simon Fraser University
Computer NetworksDistributed SystemsHPC Systems
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
K
Kan Chen
Huawei Technologies, Ltd., China
Q
Qiantao Zhang
Huawei Technologies, Ltd., China
X
Xiaolong Bai
Huawei Technologies, Ltd., China
J
Jiannan Wang
Simon Fraser University, Vancouver, Canada
Ying Xiong
Ying Xiong
Clausthal University of Technology
Petroleum geologySedimentologyGeochemistry
Y
Yong Zhang
Huawei Technologies Canada Co., Ltd., Vancouver, Canada
Zhenan Fan
Zhenan Fan
Staff Researcher at Huawei Technologies Canada
OptimizationLarge Language Model