Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing long-context LLM inference suffers from prohibitive attention computation costs and suboptimal KV cache compression, primarily due to the asymmetry between key homogeneity and value heterogeneity. Method: This paper first identifies the local homogeneity–heterogeneity structural property of KV caches and proposes a training-free, decoupled compression framework: (i) locality-aware key clustering and merging based on similarity, and (ii) information-preserving value-space projection compression, with mathematically guaranteed lossless reconstruction. Contribution/Results: Evaluated on LongBench with LLaMA3.1-8B, our method achieves a mean score of 43.95—significantly surpassing state-of-the-art approaches such as H₂O (38.89). It enables efficient inference over context lengths exceeding 10,000 tokens, establishing a theoretically rigorous and deployment-friendly paradigm for long-context optimization.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights (local homogeneity), adjacent values demonstrate distinct heterogeneous distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Address quadratic complexity in long-context LLM attention

Exploit KV cache asymmetry for efficient compression

Propose training-free compression for heterogeneous values

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits KV cache asymmetry for compression

Uses homogeneity-based key merging

Implements lossless value compression

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference