🤖 AI Summary
Transformer-based large language models face a memory bottleneck in long-sequence inference due to linear growth of KV cache memory consumption in GPU VRAM; existing compression techniques (e.g., pruning, quantization) often incur irreversible information loss. Method: We propose a CPU-GPU heterogeneous KV cache management framework that offloads the full-precision KV cache to CPU memory while retaining only low-bit importance-weighted replicas in VRAM. We introduce a novel importance-prediction–guided speculative prefetching mechanism that dynamically identifies and preloads critical KV pairs prior to decoding, enabling pipelined prefetching and computation. Contribution/Results: The method requires no model retraining and incurs zero precision loss, achieving up to 10× KV cache compression. Experiments on LongBench and Needle-in-a-Haystack show up to 10× reduction in peak VRAM usage, with no degradation in generation quality and unchanged end-to-end latency.
📝 Abstract
Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.