KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the high memory demand of key-value (KV) caching in long-context large language model inference, which often leads to GPU memory overflow. Existing offloading approaches suffer from excessive data transfer and decoding latency under long-context and large-batch settings. To overcome these limitations, the authors propose a multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. The system employs an attention-aware caching policy to dynamically enhance cache reuse, restructures the decoding pipeline to overlap I/O with computation, and coordinates cross-tier data migration in a unified manner. Experimental results on mainstream large language models and long-context benchmarks demonstrate up to a 1.74× throughput improvement over prior methods, without compromising model accuracy.

📝 Abstract

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

Problem

Research questions and friction points this paper is trying to address.

KV cache

long-context LLM inference

memory offloading

decoding latency

multi-tier memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache management

multi-tier memory system

long-context LLM inference