KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work addresses the high memory and computational overhead of key-value (KV) caching in large language models, which severely hinders deployment efficiency. For the first time, it theoretically reveals that spectral concentration in Query/Key weights induces feature homogenization, while spectral dispersion in Value weights preserves heterogeneity. Building on this insight, the paper proposes a gradient-free, closed-form KV compression method that relies solely on forward-pass variables and leverages exact Hessian information derived from spectral analysis. This approach enables highly efficient asymmetric compression. Evaluated on Llama3.1-8B-Instruct, the method improves the average LongBench score by 0.92 while reducing memory usage by 29% and inference latency by 28%, outperforming all existing state-of-the-art techniques.

Technology Category

Application Category

📝 Abstract

The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.

Problem

Research questions and friction points this paper is trying to address.

KV cache

Large Language Models

KV merging

memory efficiency

inference overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV merging

spectral energy distribution

Hessian approximation