KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the performance bottleneck caused by cross-node KV cache transmission in disaggregated large language model serving, where existing static compression methods fail to adapt to dynamic service conditions. The authors propose a service-aware adaptive KV cache compression framework that dynamically selects optimal configurations under latency and quality-of-service constraints. This is achieved by constructing a modular compression strategy space and integrating offline Bayesian optimization with a lightweight online multi-armed bandit controller. The approach establishes the first service-aware adaptive mechanism for KV cache compression, substantially reducing offline search overhead and bridging the gap between offline tuning and online deployment. Experiments demonstrate up to a 9.13× speedup in job completion time under PD-disaggregated architectures and as much as a 32.8× reduction in time-to-first-token latency in KV-disaggregated systems.

📝 Abstract

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

disaggregated LLM serving

communication bottleneck

service-aware adaptation

dynamic workload

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression

disaggregated LLM serving

service-aware adaptation