Adaptive KV Cache Reuse for Fast Long-Context LLM Serving

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high time-to-first-token (TTFT) latency in long-context large language model inference, which stems from the limitation that existing KV cache reuse is restricted to strict prefix matches and incurs substantial I/O overhead when migrating caches across heterogeneous hardware. To overcome this, the authors propose CacheTune, the first system enabling efficient, semantically consistent KV cache reuse beyond exact prefixes. CacheTune combines offline frequency-domain analysis to identify critical KV pairs with online selective recomputation of semantically important tokens, augmented by sparse KV transfer, multi-stream asynchronous I/O overlapping, delayed positional encoding recovery, and a hardware-aware adaptive recomputation strategy. Evaluated on mainstream models and long-context tasks, CacheTune achieves 3.72–4.86× TTFT speedup and 3.93–6.21× throughput improvement over full recomputation while preserving near-identical generation quality—even when offloading caches to SSDs or HDDs, it maintains a 2.34–2.36× TTFT acceleration.
📝 Abstract
In long-context Large Language Model (LLM) inference, the Time-To-First-Token (TTFT) latency incurred by the prefill stage has become the foremost bottleneck limiting interactive performance and deployment cost. KV Cache reuse offers a direct path to reduce redundant prefill, yet traditional prefix caching applies only to strict-prefix scenarios; directly reusing KV Cache in non-prefix settings breaks the cross-chunk global attention relationships and causes significant degradation in generation quality. When reusable KV Cache is offloaded to GPU-external cache pools, I/O overheads across heterogeneous hardware tiers further emerge as a new TTFT bottleneck. Efficient non-prefix KV Cache reuse therefore requires both semantic-consistency recovery and compute-I/O co-optimization. This paper presents CacheTune, a frequency-guided and hardware-aware KV Cache reuse system for long-context LLM serving. CacheTune first identifies, offline, the KV pairs most critical to cross-attention recovery through frequency-domain analysis, and then selectively recomputes only these semantic-critical tokens online while reusing the remaining KVs. To turn this semantic selection into end-to-end latency reduction, CacheTune further combines sparse KV transfer, multi-stream asynchronous overlap, deferred positional-encoding recovery, and hardware-aware adaptive recomputation-ratio tuning to balance computation and data movement across heterogeneous cache pools. Evaluations on mainstream LLMs and long-context tasks show that CacheTune achieves 3.72x-4.86x TTFT speedup and 3.93x-6.21x higher throughput while maintaining generation quality close to full recompute. Even when caches are offloaded to I/O-bound SSD/HDD storage, CacheTune sustains 2.34x-2.36x TTFT speedup through adaptive recomputation.
Problem

Research questions and friction points this paper is trying to address.

KV Cache reuse
long-context LLM
Time-To-First-Token
non-prefix caching
I/O overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV Cache reuse
long-context LLM
semantic-consistency recovery
compute-I/O co-optimization
adaptive recomputation
🔎 Similar Papers
2024-05-26Proceedings of the Twentieth European Conference on Computer SystemsCitations: 7
Fei Li
Fei Li
Associate Professor of Computer Science, George Mason University
AlgorithmsScheduling
S
Song Liu
School of Computer Science and Technology, Xi’an Jiaotong University
Y
Yan Liu
School of Computer Science and Technology, Xi’an Jiaotong University
J
Jinhua Cui
School of Computer Science and Technology, Huazhong University of Science and Technology
S
Shiqiang Nie
School of Computer Science and Technology, Xi’an Jiaotong University
J
Jinyu Wang
School of Computer Science and Technology, Xi’an Jiaotong University
W
Weiguo Wu
School of Computer Science and Technology, Xi’an Jiaotong University