Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

In autoregressive Transformer inference, the KV cache grows linearly with context length, causing substantial memory overhead and bandwidth bottlenecks—severely hindering low-latency, memory-efficient deployment in real-time dialogue systems. Existing KV compression or truncation methods often degrade accuracy and discard critical long-range dependencies. To address this, we propose MorphKV—the first high-fidelity long-range modeling approach enabling *constant-size* KV caches. MorphKV introduces three core innovations: (1) attention-pattern-driven dynamic cache pruning, (2) relevance-aware adaptive token reordering, and (3) lightweight iterative cache refinement—jointly eliminating early-token bias while preserving information integrity. Evaluated on long-response generation tasks, MorphKV reduces memory footprint by 52.9% versus state-of-the-art baselines, while improving average accuracy by 18.2%. These gains significantly enhance practical deployment efficiency in latency- and memory-constrained scenarios.

Technology Category

Application Category

📝 Abstract

Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Addresses excessive memory consumption in KV caching for LLMs.

Solves latency and memory efficiency issues in real-time applications.

Improves accuracy by maintaining context without lossy compression.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MorphKV maintains constant-sized KV cache.

Adaptive token ranking preserves high-fidelity context.

Lightweight updates guided by recent token attention.

🔎 Similar Papers

No similar papers found.