KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address the memory bottleneck caused by KV cache growth with sequence length in large Transformer models, this paper identifies a key limitation: existing low-rank compression methods—whether compressing only Keys or jointly embedding Queries and Keys—are suboptimal for approximating the attention matrix. We propose KQ-SVD, the first method to perform optimal low-rank approximation directly on the attention matrix (i.e., the QKᵀ product) via closed-form singular value decomposition, rather than on individual Key or Value tensors. Crucially, KQ-SVD rigorously preserves the inner-product structure between Queries and Keys, thereby significantly improving the fidelity of compressed attention outputs. Experiments on LLaMA and Mistral demonstrate that, at equal compression ratios, KQ-SVD achieves substantially higher attention projection quality than state-of-the-art baselines—including SqueezeLLM and KVQuant—establishing a theoretically grounded and practically effective new paradigm for efficient inference.

Technology Category

Application Category

📝 Abstract

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

Problem

Research questions and friction points this paper is trying to address.

Compresses KV cache to reduce memory bottleneck in LLMs

Directly optimizes low-rank decomposition of attention matrix

Preserves attention fidelity better than prior compression methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct low-rank decomposition of attention matrix

Closed-form solution for optimal compression

Preserves attention fidelity with provable guarantees

🔎 Similar Papers

No similar papers found.