🤖 AI Summary
This work addresses the tension between stringent latency SLOs and limited GPU memory in large language model inference under high request loads, which often leads to head-of-line blocking. Targeting the NVIDIA GH200 Superchip architecture with NVLink-C2C interconnects, the authors propose RotaSched, an SLO-aware proactive rotation scheduler, and DuplexKV, a full-duplex KV cache migration engine, which jointly optimize request scheduling and memory management. This approach is the first to integrate SLO-aware scheduling with full-duplex high-speed interconnects, fully exploiting the hardware capabilities of the Superchip. Experimental results demonstrate that, compared to existing systems, the proposed method improves TTFT SLO compliance by up to 74.7% while maintaining comparable time-between-tokens (TBT) latency and throughput.
📝 Abstract
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.