FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high memory overhead and latency induced by KV caching in long-context inference for large language models, this paper proposes a hierarchical KV compression framework that balances inference speed and accuracy. The method introduces two key innovations: (1) Token-Selective Propagation (TSP), a novel mechanism that retains full KV states in shallow layers while selectively propagating only salient tokens to deeper layers; and (2) a Grouped-Query Attention (GQA)-aware KV compression strategy combined with hierarchical sparsification, jointly optimizing computational and memory efficiency. Evaluated on standard long-context benchmarks, the approach achieves a 50% reduction in time-to-first-token (TTFT) — equivalent to a 2.00× speedup — and a 40% increase in throughput (1.40×), while preserving baseline-level accuracy compared to the state-of-the-art HeadKV.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00$ imes$ and 1.40$ imes$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Memory Consumption

Long Text Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

FastKV

Memory Optimization

GQA Technique

🔎 Similar Papers

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache