🤖 AI Summary
This work addresses the linear growth of memory and access overhead in KV cache with increasing context length in long-context language models, a challenge exacerbated by the limited compressibility of representations learned during pretraining. The study formalizes KV compressibility as an intrinsic model representation property and introduces KV Compression-Aware Training (KV-CAT), a framework that incorporates a sparsity-inducing masking strategy during continued pretraining to encourage the learning of inherently more compressible representations. By shaping model representations at the source, KV-CAT enhances compatibility with downstream post-hoc compression techniques, significantly improving the trade-off between compression quality and computational budget across retrieval, long-context question answering, and compressed prefix continuation tasks.
📝 Abstract
Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.