🤖 AI Summary
In autoregressive Transformer inference, the KV cache grows linearly with context length, causing substantial memory overhead and bandwidth bottlenecks—severely hindering low-latency, memory-efficient deployment in real-time dialogue systems. Existing KV compression or truncation methods often degrade accuracy and discard critical long-range dependencies. To address this, we propose MorphKV—the first high-fidelity long-range modeling approach enabling *constant-size* KV caches. MorphKV introduces three core innovations: (1) attention-pattern-driven dynamic cache pruning, (2) relevance-aware adaptive token reordering, and (3) lightweight iterative cache refinement—jointly eliminating early-token bias while preserving information integrity. Evaluated on long-response generation tasks, MorphKV reduces memory footprint by 52.9% versus state-of-the-art baselines, while improving average accuracy by 18.2%. These gains significantly enhance practical deployment efficiency in latency- and memory-constrained scenarios.
📝 Abstract
Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.