HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing

📅 2024-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The excessive memory overhead of KV caches severely hampers inference efficiency in large Transformer models. Method: This paper proposes a lightweight, pre-attention KV cache eviction mechanism leveraging locality-sensitive hashing (LSH) and binary Gaussian random projection to rapidly estimate cosine similarity between incoming queries and cached keys prior to attention computation; low-similarity entries—predicted to yield minimal attention scores—are prioritized for eviction. A GPU-resident binary index enables efficient Hamming-distance-based retrieval. Contribution/Results: Unlike conventional layer-wise or static eviction strategies, our approach is the first to perform cache eviction *before* attention computation, achieving an optimal trade-off between speed and accuracy. Experiments demonstrate 30–70% KV cache compression while preserving state-of-the-art performance across inference, multiple-choice, long-context retrieval, and summarization tasks—significantly reducing GPU memory footprint and computational cost.

Technology Category

Application Category

📝 Abstract
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, HashEvict makes these decisions pre-attention, thereby reducing computational costs. Additionally, HashEvict is dynamic - at every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that HashEvict can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
KV Cache Optimization
Memory Management
Innovation

Methods, ideas, or system contributions that make the work stand out.

HashEvict
Locality Sensitive Hashing
KV Cache Compression
🔎 Similar Papers
No similar papers found.