Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the prohibitive memory cost of key-value (KV) caching in long-context scenarios, where KV cache size grows linearly with input length and severely hinders large language model deployment. To overcome this limitation, the authors propose MixedDimKV, a novel approach that departs from conventional token-level coarse-grained compression by enabling fine-grained, dimension-level allocation of KV cache within individual tokens for the first time. They further introduce MixedDimKV-H, which jointly models attention head importance and per-dimension resource allocation to optimize memory efficiency. The method achieves full-attention performance on LongBench using only 6.25% of the original KV cache and maintains 100% accuracy on the 50K-context Needle-in-a-Haystack task with merely 0.26% cache retention.

Technology Category

Application Category

📝 Abstract

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

long-context inference

memory efficiency

transformer optimization

dimensionality allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-dimension allocation

KV cache compression

token-level dimensionality