Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck in long-context inference caused by the substantial high-bandwidth memory (HBM) consumption of key-value (KV) caches. The authors propose StiefAttention, a post-training KV cache compression method that, for the first time, incorporates orthogonality constraints into low-rank approximation by optimizing end-to-end decoder output reconstruction error over the Stiefel manifold. The approach further enables adaptive per-layer rank allocation based on a user-specified error budget. Experimental results on Llama3-8B demonstrate that, at the same compression ratio, StiefAttention reduces C4 perplexity by 11.9 points and improves zero-shot MMLU accuracy by 5.4%, while achieving lower relative output error and higher cosine similarity compared to existing methods.

Technology Category

Application Category

📝 Abstract
Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
memory bottleneck
decoder-layer reconstruction
long-context inference
post-training compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
Stiefel manifold
orthonormal projection
post-training optimization
layer-wise rank allocation
🔎 Similar Papers
No similar papers found.
L
Luca Benfenati
Department of Control and Computer Engineering, Politecnico University of Turin, Turin, Italy
Matteo Risso
Matteo Risso
PhD Student, Politecnico di Torino
Machine LearningEmbedded SystemsDesign AutomationTinyML
A
Andrea Vannozzi
Department of Control and Computer Engineering, Politecnico University of Turin, Turin, Italy
A
A. C. Yuzuguler
Huawei Zurich Research Center, Zurich, Switzerland
Lukas Cavigelli
Lukas Cavigelli
Researcher (Expert/Architect), Huawei Technologies
Deep LearningComputer ArchitectureCircuits and SystemsVLSISignal Processing
Enrico Macii
Enrico Macii
Politecnico di Torino
ElectronicsComputer Engineering
D
D. J. Pagliari
Department of Control and Computer Engineering, Politecnico University of Turin, Turin, Italy
Alessio Burrello
Alessio Burrello
Politecnico di Torino, University of Bologna
Machine learningDeep LearningTinyMLEmbedded Programming