BalanceKV: KV Cache Compression through Discrepancy Theory

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the excessive KV cache memory overhead in long-context generation by large language models (LLMs), this paper proposes a geometric sampling compression method grounded in Banaszczyk’s vector balancing theory. It is the first work to introduce vector balancing theory into KV cache compression, explicitly modeling geometric dependencies among key-value pairs to achieve high-fidelity cache pruning. The method integrates geometric sampling, low-rank approximation, and rigorous error control, yielding a theoretically tighter reconstruction error bound. Experiments demonstrate that our approach reduces memory consumption by up to 58% on long-context tasks while preserving—or even improving—generation quality. It consistently outperforms state-of-the-art compression baselines, including StreamingLLM and FlashAttention-2, across diverse benchmarks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. The memory complexity of long-context LLMs is primarily due to the need to store Key-Value (KV) embeddings in their KV cache. We present BalanceKV, a KV cache compression method based on geometric sampling process stemming from Banaszczyk's vector balancing theory, which introduces dependencies informed by the geometry of keys and value tokens, and improves precision. BalanceKV offers both theoretically proven and empirically validated performance improvements over existing methods.

Problem

Research questions and friction points this paper is trying to address.

Reduce KV cache memory usage

Improve long-context token generation

Apply geometric sampling for compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression

geometric sampling process

vector balancing theory

🔎 Similar Papers

No similar papers found.