ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the performance bottleneck in the Re-Prefill phase of large language models during multi-turn conversations, which arises from I/O amplification and serial compute-I/O dependencies caused by KV cache offloading. To mitigate these issues, the authors propose a unified-granularity ContiguousChunk data layout that eliminates read amplification, an asynchronous prefetching mechanism leveraging cross-layer block similarity, and an attention-aware, fine-grained cache retention strategy. Evaluated on the Qwen2.5 model series, the approach achieves a 3.85× speedup in Re-Prefill latency over IMPRESS while preserving high output quality, effectively alleviating the efficiency–quality trade-off inherent in long-context inference.

Technology Category

Application Category

📝 Abstract

Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV cache and generating the first token, defined as the Re-Prefill Phase. Offloading this shared prefix cache to secondary storage is essential for memory scalability. Re-Prefill with offloading suffers from severe I/O bottlenecks in two aspects. First, semantic-aware KV cache pruning algorithms select important tokens in fine granularity, while systems manage I/O in coarse, fixed-size blocks, causing severe read amplification. Second, the sequential dependency between identifying important tokens and loading KV cache creates idle I/O and compute bubbles, under-utilizing system resources. This paper proposes \textit{ContiguousKV}, a high-performance prefix KV cache offloading system that bridges algorithmic semantics with I/O efficiency to accelerate the Re-Prefill phase. We first introduce \textit{ContiguousChunk}, a unified data management granularity that aligns KV cache pruning with I/O operations. All the mechanisms critical for I/O performance are performed at the granularity of ContiguousChunk, thereby eliminating read amplification. By exploiting the high similarity in important ContiguousChunk indices across layers, we propose intra- and inter-period asynchronous prefetching to break the sequential dependency between I/O and compute, effectively eliminating idle bubbles. Finally, we propose attention-guided cache management to retain semantically critical prefix data in memory. Evaluations on Qwen2.5 series models show that ContiguousKV achieves a 3.85x speedup in the Re-Prefill phase over the state-of-the-art offloading system IMPRESS, while maintaining high output quality.

Problem

Research questions and friction points this paper is trying to address.

KV cache offloading

Re-Prefill phase

I/O bottleneck

read amplification

resource under-utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

ContiguousKV

KV cache offloading

granularity alignment