🤖 AI Summary
This work addresses the high memory and computational overhead of key-value (KV) caching in large language models, which severely hinders deployment efficiency. For the first time, it theoretically reveals that spectral concentration in Query/Key weights induces feature homogenization, while spectral dispersion in Value weights preserves heterogeneity. Building on this insight, the paper proposes a gradient-free, closed-form KV compression method that relies solely on forward-pass variables and leverages exact Hessian information derived from spectral analysis. This approach enables highly efficient asymmetric compression. Evaluated on Llama3.1-8B-Instruct, the method improves the average LongBench score by 0.92 while reducing memory usage by 29% and inference latency by 28%, outperforming all existing state-of-the-art techniques.
📝 Abstract
The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.