SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tension between stringent latency SLOs and limited GPU memory in large language model inference under high request loads, which often leads to head-of-line blocking. Targeting the NVIDIA GH200 Superchip architecture with NVLink-C2C interconnects, the authors propose RotaSched, an SLO-aware proactive rotation scheduler, and DuplexKV, a full-duplex KV cache migration engine, which jointly optimize request scheduling and memory management. This approach is the first to integrate SLO-aware scheduling with full-duplex high-speed interconnects, fully exploiting the hardware capabilities of the Superchip. Experimental results demonstrate that, compared to existing systems, the proposed method improves TTFT SLO compliance by up to 74.7% while maintaining comparable time-between-tokens (TBT) latency and throughput.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
Service Level Objectives
KV cache
head-of-line blocking
latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLO-aware scheduling
rotary scheduling
KV cache management
NVLink-C2C
LLM inference
🔎 Similar Papers
No similar papers found.
Jiahuan Yu
Jiahuan Yu
University of Illinois Urbana-Champaign
Machine Learning SystemComputer Vision3D Vision
M
Mingtao Hu
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign
Z
Zichao Lin
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign
Minjia Zhang
Minjia Zhang
University of Illinois at Urbana-Champagin
ParallelismMachine Learning SystemsModel CompressionLLM Application