🤖 AI Summary
To address the excessive KV cache memory overhead in long-context generation by large language models (LLMs), this paper proposes a geometric sampling compression method grounded in Banaszczyk’s vector balancing theory. It is the first work to introduce vector balancing theory into KV cache compression, explicitly modeling geometric dependencies among key-value pairs to achieve high-fidelity cache pruning. The method integrates geometric sampling, low-rank approximation, and rigorous error control, yielding a theoretically tighter reconstruction error bound. Experiments demonstrate that our approach reduces memory consumption by up to 58% on long-context tasks while preserving—or even improving—generation quality. It consistently outperforms state-of-the-art compression baselines, including StreamingLLM and FlashAttention-2, across diverse benchmarks.
📝 Abstract
Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. The memory complexity of long-context LLMs is primarily due to the need to store Key-Value (KV) embeddings in their KV cache. We present BalanceKV, a KV cache compression method based on geometric sampling process stemming from Banaszczyk's vector balancing theory, which introduces dependencies informed by the geometry of keys and value tokens, and improves precision. BalanceKV offers both theoretically proven and empirically validated performance improvements over existing methods.