KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

272K/year
🤖 AI Summary
This work addresses the high memory demand of key-value (KV) caching in long-context large language model inference, which often leads to GPU memory overflow. Existing offloading approaches suffer from excessive data transfer and decoding latency under long-context and large-batch settings. To overcome these limitations, the authors propose a multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. The system employs an attention-aware caching policy to dynamically enhance cache reuse, restructures the decoding pipeline to overlap I/O with computation, and coordinates cross-tier data migration in a unified manner. Experimental results on mainstream large language models and long-context benchmarks demonstrate up to a 1.74× throughput improvement over prior methods, without compromising model accuracy.
📝 Abstract
Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.
Problem

Research questions and friction points this paper is trying to address.

KV cache
long-context LLM inference
memory offloading
decoding latency
multi-tier memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache management
multi-tier memory system
long-context LLM inference
pipeline scheduling
heterogeneous resource coordination
🔎 Similar Papers