Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This study addresses the KV cache memory bottleneck faced by large language models in long-context settings (>100K tokens), where existing compression methods overlook the roles of semantic accessibility and routing structure within attention mechanisms. The authors frame KV compression as a controlled perturbation of token-level attention routing and introduce a tripartite routing perspective—preservation, accessibility, and utilization. Through synthetic tasks, a novel Global Eviction Rate (GER) metric, and cross-architecture evaluations (LLaMA, Qwen), they reveal that moderate compression preserves task accuracy yet degrades representational quality, while hallucinations surge and GER exhibits abrupt shifts near 90% compression. The work further demonstrates model-specific routing resilience and shows that sparse token-routing structures govern compression tolerance, linking long-context scalability to the lottery ticket hypothesis in self-attention.

Technology Category

Application Category

📝 Abstract

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

attention dynamics

semantic accessibility

long-context LLMs

memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression

attention dynamics

routing perturbation