KVSculpt: KV Cache Compression as Distillation

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the excessive memory consumption of KV cache in long-context reasoning with large language models by proposing a method that directly optimizes lightweight, unconstrained KV pairs in the continuous embedding space while preserving per-layer attention behavior. The core innovations include a reconstruction strategy that entirely decouples from original KV entries, an adaptive layer- and head-wise budget allocation mechanism guided by pre-compression evaluation, and an alternating optimization framework that employs L-BFGS to refine key vectors and least squares to solve for value vectors. Experiments on Qwen2.5-1.5B-Instruct demonstrate a 3.5–4.1× reduction in KL divergence compared to the Select+Fit baseline, with adaptive budget allocation yielding an additional 1.3× performance improvement.
📝 Abstract
KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
long-context LLM inference
attention behavior preservation
compression budget allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
continuous optimization
adaptive budget allocation
attention preservation
L-BFGS
🔎 Similar Papers
No similar papers found.