🤖 AI Summary
This work addresses the inefficiency of standard key-value (KV) caching in large language models, where context dependence necessitates recomputation when reusing cached states in new contexts, incurring significant computational overhead and latency. To overcome this limitation, the authors propose the KV Packet framework, which encapsulates cached KV states into immutable “packets” and introduces a lightweight, trainable soft token adapter to bridge contextual discontinuities. This approach enables, for the first time, fully recomputation-free, context-agnostic reuse of KV caches. Combined with a self-supervised distillation strategy to correct attention distribution shifts, the method achieves near-zero FLOPs overhead and reduced first-token latency on Llama-3.1 and Qwen2.5, while maintaining F1 scores comparable to those of full recomputation baselines.
📝 Abstract
Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.