🤖 AI Summary
To address the explosive memory consumption and high computational overhead of KV caches in long-context reasoning with large language models (LLMs), this paper proposes a task-agnostic, learnable KV cache distillation framework. Methodologically, it adopts a student–teacher distillation paradigm, aligning output distributions via KL divergence and enabling parameter-efficient fine-tuning through lightweight adapters—without altering the original model architecture. Notably, it is the first approach to support arbitrary-span, near-lossless KV cache compression while preserving pretraining capabilities without degradation. Experiments demonstrate that the method significantly outperforms existing compression techniques on extractive tasks; achieves performance on par with the full model for long-text question answering and summarization; and maintains downstream accuracy even when KV cache length is reduced by 99% after domain-specific fine-tuning. Moreover, it exhibits strong generalizability across diverse model scales and architectures.
📝 Abstract
Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.