KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

258K/year
🤖 AI Summary
This work addresses the performance bottleneck caused by cross-node KV cache transmission in disaggregated large language model serving, where existing static compression methods fail to adapt to dynamic service conditions. The authors propose a service-aware adaptive KV cache compression framework that dynamically selects optimal configurations under latency and quality-of-service constraints. This is achieved by constructing a modular compression strategy space and integrating offline Bayesian optimization with a lightweight online multi-armed bandit controller. The approach establishes the first service-aware adaptive mechanism for KV cache compression, substantially reducing offline search overhead and bridging the gap between offline tuning and online deployment. Experiments demonstrate up to a 9.13× speedup in job completion time under PD-disaggregated architectures and as much as a 32.8× reduction in time-to-first-token latency in KV-disaggregated systems.
📝 Abstract
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
disaggregated LLM serving
communication bottleneck
service-aware adaptation
dynamic workload
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
disaggregated LLM serving
service-aware adaptation
Bayesian profiling
online controller
Z
Zedong Liu
University of Chinese Academy of Sciences; Institute of Computing Technology, Chinese Academy of Sciences
X
Xinyang Ma
University of Chinese Academy of Sciences; Institute of Computing Technology, Chinese Academy of Sciences
D
Dejun Luo
University of Chinese Academy of Sciences
H
Hairui Zhao
Institute of Computing Technology, Chinese Academy of Sciences
Bing Lu
Bing Lu
Simon Fraser University
Remote SensingEnvironmental ChangeVegetation EcologyUAV
Wenjing Huang
Wenjing Huang
RAND Corporation
PsychometricsStructural Equation ModelingItem Response TheoryCyber Security
Y
Yida Gu
Institute of Computing Technology, Chinese Academy of Sciences
X
Xingchen Liu
Institute of Computing Technology, Chinese Academy of Sciences
Z
Zheng Wei
Institute of Computing Technology, Chinese Academy of Sciences
J
Jinyang Liu
Shanghai Jiao Tong University
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU
G
Guangming Tan
Institute of Computing Technology, Chinese Academy of Sciences