HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Large language model (LLM) serving faces significant scheduling and elastic scaling challenges due to highly heterogeneous request lengths, priorities, and stage-specific service-level objectives (SLOs)—notably divergent SLOs for prefill versus decode phases. This paper introduces HyperFlexis, the first unified system enabling joint multi-SLO–aware scheduling and cost-sensitive autoscaling. Its core innovations include: (1) multi-SLO–aware dynamic scheduling; (2) cross-instance KV cache migration; (3) dynamic binding of prefill/decode instances with millisecond-scale role switching; (4) budget-driven cold-start optimization; (5) direct device-to-device weight transmission; and (6) fine-grained priority partitioning. Experiments demonstrate that HyperFlexis improves SLO compliance by up to 4.44× over state-of-the-art systems, reduces P99 latency by 65.82%, significantly enhances resource utilization, and maintains equivalent cost efficiency.

Technology Category

Application Category

📝 Abstract

Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present extbf{HyperFlexis}, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to extbf{19.39$ imes$}. These optimizations allow the system to achieve up to extbf{4.44$ imes$} higher SLO attainment, extbf{65.82%} lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Optimizing scheduling and scaling for multi-SLO LLM serving

Reducing cold-start latency with efficient weight transfer

Supporting both collocated and disaggregated Prefill/Decode architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-SLO-aware scheduler with budget estimation

Device-to-device weight transfer mechanism

Prefill-decode instance linking during scaling

🔎 Similar Papers

No similar papers found.