KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address the explosive memory consumption and high computational overhead of KV caches in long-context reasoning with large language models (LLMs), this paper proposes a task-agnostic, learnable KV cache distillation framework. Methodologically, it adopts a student–teacher distillation paradigm, aligning output distributions via KL divergence and enabling parameter-efficient fine-tuning through lightweight adapters—without altering the original model architecture. Notably, it is the first approach to support arbitrary-span, near-lossless KV cache compression while preserving pretraining capabilities without degradation. Experiments demonstrate that the method significantly outperforms existing compression techniques on extractive tasks; achieves performance on par with the full model for long-text question answering and summarization; and maintains downstream accuracy even when KV cache length is reduced by 99% after domain-specific fine-tuning. Moreover, it exhibits strong generalizability across diverse model scales and architectures.

Technology Category

Application Category

📝 Abstract

Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

Problem

Research questions and friction points this paper is trying to address.

Reduces GPU memory usage in LLMs by compressing KV caches.

Enables efficient context compression without losing model performance.

Improves scalability of long-context tasks in sequence-to-sequence models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Distill compresses long context KV caches efficiently

Uses KL divergence for student-teacher output matching

Reduces context length by 99% with preserved performance

🔎 Similar Papers

No similar papers found.