Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the substantial memory and bandwidth overhead of key-value (KV) caching in large language model inference, which becomes a critical bottleneck for long-sequence processing. The authors propose Self-Pruning Key-Value Attention (SP-KV), a lightweight, end-to-end trainable mechanism that dynamically evaluates the future utility of each KV pair via a utility predictor. Only high-value pairs are written to the global cache, while recent context is preserved through a local window, enabling fine-grained dynamic compression without requiring a predefined compression ratio. SP-KV is the first method to prune KV caches based on predicted future utility, revealing layer- and head-specific sparsity patterns. Experiments demonstrate that the approach achieves 3–10× dynamic compression of the KV cache with negligible degradation in validation loss and downstream task performance, significantly improving memory efficiency and decoding speed for long sequences.

📝 Abstract

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

Problem

Research questions and friction points this paper is trying to address.

Key-Value cache

memory efficiency

long-sequence generation

transformer architectures

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Pruned KV Attention

KV Cache Compression

Dynamic Sparsification