🤖 AI Summary
Existing auto-scaling mechanisms struggle to handle bursty traffic in LLM serving, leading to severe violations of TTFT/TPOT SLOs and substantial resource over-provisioning. This paper addresses decoupled LLM serving—where prefill and decoding are executed on separate GPU clusters—by proposing a proactive scaling method based on Token Velocity (TV), a lightweight, low-latency workload rate metric. We further introduce the Convertible Decoder architecture, enabling decoding GPUs to dynamically assume prefill tasks, thereby achieving cross-stage elastic reuse of compute resources. Evaluated on real production workloads, our approach improves SLO compliance to 80–96%, reduces cost by 4–14%, and significantly outperforms state-of-the-art systems including DistServe, BlitzScale, and AIBrix. Our key contributions are: (i) the first use of TV for LLM auto-scaling decisions, and (ii) hardware-level task convertibility that breaks the rigid prefill/decode resource isolation barrier.
📝 Abstract
The architectural shift to prefill/decode (PD) disaggregation in LLM serving improves resource utilization but struggles with the bursty nature of modern workloads. Existing autoscaling policies, often retrofitted from monolithic systems like those in AIBrix and DistServe, rely on lagging indicators such as GPU utilization or coarse-grained request counts. This results in slow reactions to load spikes, leading to significant Time-to First-Token (TTFT) and Time-Per-Output-Token (TPOT) SLO violations and costly over-provisioning. We introduce TokenScale, an autoscaling framework that resolves this performance mismatch through two innovations. First, we propose Token Velocity, a novel metric that unifies the prefill, network, and decode stages by quantifying their rate of work. As a leading indicator of system backpressure, it enables proactive scaling. Second, Convertible Decoders allow decoder GPUs to dynamically execute prefill tasks during traffic spikes, creating a rapid-response buffer that absorbs bursts and eliminates the initialization latency of new prefillers. Our evaluation on a GPU cluster with production traces shows TokenScale improves SLO attainment from 50-88% to 80-96% and reduces costs by 4-14% over state-of-the-art systems, including DistServe, BlitzScale, and AIBrix. By uniting a predictive metric with a flexible system design, TokenScale significantly boosts the performance and efficiency of disaggregated LLM serving infrastructure.