KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the inefficiency of standard key-value (KV) caching in large language models, where context dependence necessitates recomputation when reusing cached states in new contexts, incurring significant computational overhead and latency. To overcome this limitation, the authors propose the KV Packet framework, which encapsulates cached KV states into immutable “packets” and introduces a lightweight, trainable soft token adapter to bridge contextual discontinuities. This approach enables, for the first time, fully recomputation-free, context-agnostic reuse of KV caches. Combined with a self-supervised distillation strategy to correct attention distribution shifts, the method achieves near-zero FLOPs overhead and reduced first-token latency on Llama-3.1 and Qwen2.5, while maintaining F1 scores comparable to those of full recomputation baselines.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.

Problem

Research questions and friction points this paper is trying to address.

KV caching

context-dependent

recomputation

inference latency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV caching

recomputation-free

soft-token adapters