CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the performance bottleneck caused by KV cache restoration in long-context serving scenarios—such as multi-turn dialogue and retrieval-augmented generation—for large language models. To tackle this challenge, the paper introduces the first three-dimensional parallel framework specifically designed for KV cache restoration. The approach models the restoration process as a multidimensional parallel task across tokens, layers, and GPUs, integrating a Transformer-structure-aware 3D parallel strategy with a batch-aware dual-pointer scheduling algorithm. This enables coordinated scheduling across requests, layers, and devices, while effectively overlapping computation and I/O operations. Experimental results demonstrate that, across diverse models, workloads, and hardware configurations, the proposed method reduces Time-To-First-Token (TTFT) by 10% to 62% compared to existing approaches.
📝 Abstract
KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing approaches treat restoration as a per-request tradeoff between recomputation and I/O transfer, recomputing KV states from scratch or offloading them from external storage (e.g., CPU memory or remote machines). However, existing advances fail to exploit parallelism across tokens, layers, and distributed deployments, and critically ignore resource contention under batched serving. We present CacheFlow, a KV cache restoration framework that rethinks cache restoration as a multi-dimensional parallel execution problem. CacheFlow introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs, enabling fine-grained overlap of recomputation and I/O along the structural dependencies of transformer inference. At the core of CacheFlow is a batch-aware two-pointer scheduler that jointly optimizes compute and I/O allocation across requests by prioritizing operations with the highest marginal reduction in recomputation cost. Our evaluations show that CacheFlow reduces Time-To-First-Token (TTFT) by 10%-62% over existing advances across diverse models, workloads, and hardware.
Problem

Research questions and friction points this paper is trying to address.

KV cache restoration
long-context LLM serving
resource contention
batched inference
I/O bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-parallel KV cache restoration
KV cache management
LLM serving optimization
batch-aware scheduling
compute-I/O overlap
🔎 Similar Papers
No similar papers found.