🤖 AI Summary
This work addresses the excessive memory consumption of KV cache in long-context reasoning with large language models by proposing a method that directly optimizes lightweight, unconstrained KV pairs in the continuous embedding space while preserving per-layer attention behavior. The core innovations include a reconstruction strategy that entirely decouples from original KV entries, an adaptive layer- and head-wise budget allocation mechanism guided by pre-compression evaluation, and an alternating optimization framework that employs L-BFGS to refine key vectors and least squares to solve for value vectors. Experiments on Qwen2.5-1.5B-Instruct demonstrate a 3.5–4.1× reduction in KL divergence compared to the Select+Fit baseline, with adaptive budget allocation yielding an additional 1.3× performance improvement.
📝 Abstract
KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.