KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory and computational overhead of key-value (KV) caching in large language models, which severely hinders deployment efficiency. For the first time, it theoretically reveals that spectral concentration in Query/Key weights induces feature homogenization, while spectral dispersion in Value weights preserves heterogeneity. Building on this insight, the paper proposes a gradient-free, closed-form KV compression method that relies solely on forward-pass variables and leverages exact Hessian information derived from spectral analysis. This approach enables highly efficient asymmetric compression. Evaluated on Llama3.1-8B-Instruct, the method improves the average LongBench score by 0.92 while reducing memory usage by 29% and inference latency by 28%, outperforming all existing state-of-the-art techniques.

Technology Category

Application Category

📝 Abstract
The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
Problem

Research questions and friction points this paper is trying to address.

KV cache
Large Language Models
KV merging
memory efficiency
inference overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV merging
spectral energy distribution
Hessian approximation
gradient-free optimization
LLM inference efficiency
L
Lianjun Liu
School of Information and Communication Engineering, Hainan University, Haikou, China
H
Hongli An
School of Cyberspace Security, Hainan University, Haikou, China
W
Weiqi Yan
MAC Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China
Xin Du
Xin Du
Hainan University
Exceptional Model MiningTrustworthy Machine LearningCausal InferenceSpatioTemporal Data Mining
Shengchuan Zhang
Shengchuan Zhang
Xiamen University
computer visionmachine learning
Huazhong Liu
Huazhong Liu
Huazhong University of Science and Technology
computer sciencebig datahigh performance computing
Yunshan Zhong
Yunshan Zhong
Hainan university