🤖 AI Summary
This work addresses the performance bottleneck caused by cross-node KV cache transmission in disaggregated large language model serving, where existing static compression methods fail to adapt to dynamic service conditions. The authors propose a service-aware adaptive KV cache compression framework that dynamically selects optimal configurations under latency and quality-of-service constraints. This is achieved by constructing a modular compression strategy space and integrating offline Bayesian optimization with a lightweight online multi-armed bandit controller. The approach establishes the first service-aware adaptive mechanism for KV cache compression, substantially reducing offline search overhead and bridging the gap between offline tuning and online deployment. Experiments demonstrate up to a 9.13× speedup in job completion time under PD-disaggregated architectures and as much as a 32.8× reduction in time-to-first-token latency in KV-disaggregated systems.
📝 Abstract
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.