Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the high prefill latency and substantial cloud costs associated with long-context large language model (LLM) inference. We propose the first verifiable public-cloud cost analysis framework to systematically evaluate the economic benefits of KV cache reuse across computation, storage, and network dimensions. Through analytical modeling, we quantify how key workload parameters—including reuse frequency, generation length, and model scale—affect end-to-end latency and total cost. Our results demonstrate that, under high reuse rates and long-generation scenarios, KV cache reuse simultaneously reduces both latency and total cloud expenditure, thereby breaking the conventional latency–cost trade-off. This study provides a rigorous theoretical foundation and practical paradigm for economically viable context-augmented LLM services.

Technology Category

Application Category

📝 Abstract

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

Problem

Research questions and friction points this paper is trying to address.

Evaluate economic feasibility of reusing KV caches in LLMs.

Develop model to analyze cloud costs for KV cache storage.

Assess cost savings from reusing KV caches in cloud services.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses stored KV caches to reduce delays

Proposes cost model for cloud-based KV cache storage

Demonstrates delay and cost savings in LLM workloads

🔎 Similar Papers

FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference