Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the memory explosion of KV caches in large language model inference, this paper proposes a training-free, lightweight KV compression method grounded in the geometric structure of query (Q) and key (K) vectors. The core contribution is the first identification of an intrinsic low-dimensional distribution of Q/K embeddings in their joint space—enabling attention computation to be approximated without explicit attention scoring. Leveraging this insight, we design a context-agnostic projection-based filtering mechanism that dynamically retains salient KV pairs. Our method is fully compatible with FlashAttention and supports both cache sparsification and attention score approximation. Experiments demonstrate: (1) 99% accuracy on needle-in-a-haystack retrieval under 32× KV compression; (2) a 65% reduction in perplexity over Streaming-LLM for long-context text generation; and (3) long-context retrieval performance on par with SnapKV.

Technology Category

Application Category

📝 Abstract

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Problem

Research questions and friction points this paper is trying to address.

KV Cache memory bottleneck in large language models

Efficient compression of Key-Value pairs without attention maps

Compatibility with FlashAttention and improved text generation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses QK geometry for KV Cache compression

Filters Key-Value pairs with single projection

Compatible with FlashAttention, no attention weights needed

🔎 Similar Papers

No similar papers found.