PQCache: Product Quantization-based KVCache for Long Context LLM Inference

📅 2024-07-01

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 2

career value

222K/year

🤖 AI Summary

To address the GPU memory bottleneck induced by KV cache in long-context LLM inference, this work pioneers the application of Product Quantization (PQ) to KV cache compression, framing it as an approximate nearest neighbor search in embedding space. We propose an overlapping block partitioning scheme coupled with a hierarchical caching mechanism that eliminates extraneous computation and communication overhead across both prefill and decode stages. Our method preserves model quality while substantially reducing service latency: it achieves a 4.60% score improvement on InfiniteBench, outperforms state-of-the-art methods in both prefill and decode latency, and enables efficient inference over context lengths exceeding 10,000 tokens. The core contributions are (i) a PQ-driven KV compression paradigm that drastically reduces memory footprint without accuracy degradation, and (ii) a zero-overhead hierarchical scheduling design that seamlessly integrates compression into the inference pipeline.

Technology Category

Application Category

📝 Abstract

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, we use PQ codes and centroids to approximately identify important preceding tokens, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments demonstrate that PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench and low system latency in both prefilling and decoding.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory bottleneck in long-context LLM inference

Maintaining model quality while minimizing serving latency

Efficient KVCache storage and retrieval using Product Quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Product Quantization for KVCache management

Approximate important tokens with PQ codes

Minimize overhead via overlapping and caching

🔎 Similar Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization