On 10x Better Scalability: KV Stores Scale Up KV Cache

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing file-system-based KV caches suffer from high metadata overhead, low I/O efficiency, and poor spatial locality, severely limiting scalability of KV caching in LLM inference and exacerbating time-to-first-token (TTFT). This paper pioneers the systematic adoption of the LSM-tree storage architecture for large-scale LLM KV cache management, proposing SGLANG-LSM. Its core contributions are: (1) a prefix-preserving key-value separation storage engine; (2) an adaptive configuration tuning controller; and (3) a lightweight runtime service. By leveraging batched writes, dynamic resource scheduling, and log-structured management, SGLANG-LSM achieves up to a 143% improvement in cache hit rate and a 24% reduction in TTFT under highly dynamic workloads—significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) rely on Key-Value (KV) cache to reduce time- to-first-token (TTFT) latency, but existing disk-based KV cache systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge- tree (LSM-tree) architectures for scalable KV cache management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-preserving storage engine that maintains token sequence locality while efficiently storing large KV cache tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch opera- tions and automatic resource management for production deployment. Evaluation on large-scale dynamic workloads demonstrates that SGLANG-LSM significantly improves cache hits by up to 143% and reduces TTFT by up to 24% compared to state-of-the-art systems, representing the first systematic application of database storage architectures to large-scale LLM cache management.

Problem

Research questions and friction points this paper is trying to address.

Addresses scalability bottlenecks in disk-based KV cache systems for LLMs

Reduces file system metadata overhead and I/O inefficiency in KV caching

Improves spatial locality and cache management for large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LSM-tree architecture for scalable KV cache management

Prefix-preserving storage engine with key-value separation

Adaptive controller optimizing LSM-tree configurations dynamically

🔎 Similar Papers

No similar papers found.