ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the excessive KV cache memory overhead in long-context reasoning of large language models (LLMs) and the significant accuracy degradation or additional computational cost incurred by existing low-rank compression methods at high compression ratios, this paper proposes a fine-tuning-free, post-training differential low-rank KV compression framework. Our method comprises two core innovations: (1) Head-level Similarity-aware Reordering (HSR), which enhances structural consistency within attention heads; and (2) Offline Calibration Matrix Fusion (OCMF), enabling decoupled Key/Value modeling and zero-overhead deployment. Leveraging SVD, head clustering, and grouped projection, the approach requires no online computation. Evaluated across multiple benchmarks, it achieves a 50% reduction in hidden dimension size while incurring an average performance drop of ≤0.3%, substantially outperforming prior low-rank compression techniques.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. Code is available at: https://github.com/XIANGLONGYAN/ReCalKV.

Problem

Research questions and friction points this paper is trying to address.

Compress KV cache to reduce memory for long-context LLMs

Minimize performance loss under high compression ratios

Optimize Key and Value compression strategies separately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-wise Similarity-aware Reordering for Keys

Offline Calibration for Values

Grouped SVD on key projection matrix

🔎 Similar Papers

No similar papers found.