TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing auto-scaling mechanisms struggle to handle bursty traffic in LLM serving, leading to severe violations of TTFT/TPOT SLOs and substantial resource over-provisioning. This paper addresses decoupled LLM serving—where prefill and decoding are executed on separate GPU clusters—by proposing a proactive scaling method based on Token Velocity (TV), a lightweight, low-latency workload rate metric. We further introduce the Convertible Decoder architecture, enabling decoding GPUs to dynamically assume prefill tasks, thereby achieving cross-stage elastic reuse of compute resources. Evaluated on real production workloads, our approach improves SLO compliance to 80–96%, reduces cost by 4–14%, and significantly outperforms state-of-the-art systems including DistServe, BlitzScale, and AIBrix. Our key contributions are: (i) the first use of TV for LLM auto-scaling decisions, and (ii) hardware-level task convertibility that breaks the rigid prefill/decode resource isolation barrier.

Technology Category

Application Category

📝 Abstract
The architectural shift to prefill/decode (PD) disaggregation in LLM serving improves resource utilization but struggles with the bursty nature of modern workloads. Existing autoscaling policies, often retrofitted from monolithic systems like those in AIBrix and DistServe, rely on lagging indicators such as GPU utilization or coarse-grained request counts. This results in slow reactions to load spikes, leading to significant Time-to First-Token (TTFT) and Time-Per-Output-Token (TPOT) SLO violations and costly over-provisioning. We introduce TokenScale, an autoscaling framework that resolves this performance mismatch through two innovations. First, we propose Token Velocity, a novel metric that unifies the prefill, network, and decode stages by quantifying their rate of work. As a leading indicator of system backpressure, it enables proactive scaling. Second, Convertible Decoders allow decoder GPUs to dynamically execute prefill tasks during traffic spikes, creating a rapid-response buffer that absorbs bursts and eliminates the initialization latency of new prefillers. Our evaluation on a GPU cluster with production traces shows TokenScale improves SLO attainment from 50-88% to 80-96% and reduces costs by 4-14% over state-of-the-art systems, including DistServe, BlitzScale, and AIBrix. By uniting a predictive metric with a flexible system design, TokenScale significantly boosts the performance and efficiency of disaggregated LLM serving infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Autoscaling struggles with bursty workloads in disaggregated LLM serving.
Existing policies cause slow reactions, leading to SLO violations and over-provisioning.
TokenScale addresses performance mismatch with proactive scaling and flexible resource use.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Velocity metric for proactive scaling
Convertible Decoders absorb traffic bursts
Unified predictive metric with flexible design
🔎 Similar Papers
No similar papers found.
R
Ruiqi Lai
NTU Singapore
H
Hongrui Liu
NTU Singapore
Chengzhi Lu
Chengzhi Lu
SIAT
Cloud ComputingServerlessDeep learningOperating system
Z
Zonghao Liu
NTU Singapore
S
Siyu Cao
NTU Singapore
S
Siyang Shao
Georgia Institute of Technology
Y
Yixin Zhang
Alibaba Group
Luo Mai
Luo Mai
Associate Professor at University of Edinburgh
Computer SystemsMachine LearningData Management
Dmitrii Ustiugov
Dmitrii Ustiugov
NTU Singapore
Cloud computingServerlessSystems for ML