🤖 AI Summary
The excessive memory overhead of KV caches severely hampers inference efficiency in large Transformer models.
Method: This paper proposes a lightweight, pre-attention KV cache eviction mechanism leveraging locality-sensitive hashing (LSH) and binary Gaussian random projection to rapidly estimate cosine similarity between incoming queries and cached keys prior to attention computation; low-similarity entries—predicted to yield minimal attention scores—are prioritized for eviction. A GPU-resident binary index enables efficient Hamming-distance-based retrieval.
Contribution/Results: Unlike conventional layer-wise or static eviction strategies, our approach is the first to perform cache eviction *before* attention computation, achieving an optimal trade-off between speed and accuracy. Experiments demonstrate 30–70% KV cache compression while preserving state-of-the-art performance across inference, multiple-choice, long-context retrieval, and summarization tasks—significantly reducing GPU memory footprint and computational cost.
📝 Abstract
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, HashEvict makes these decisions pre-attention, thereby reducing computational costs. Additionally, HashEvict is dynamic - at every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that HashEvict can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.