🤖 AI Summary
Existing file-system-based KV caches suffer from high metadata overhead, low I/O efficiency, and poor spatial locality, severely limiting scalability of KV caching in LLM inference and exacerbating time-to-first-token (TTFT). This paper pioneers the systematic adoption of the LSM-tree storage architecture for large-scale LLM KV cache management, proposing SGLANG-LSM. Its core contributions are: (1) a prefix-preserving key-value separation storage engine; (2) an adaptive configuration tuning controller; and (3) a lightweight runtime service. By leveraging batched writes, dynamic resource scheduling, and log-structured management, SGLANG-LSM achieves up to a 143% improvement in cache hit rate and a 24% reduction in TTFT under highly dynamic workloads—significantly outperforming state-of-the-art approaches.
📝 Abstract
Large language models (LLMs) rely on Key-Value (KV) cache to reduce time- to-first-token (TTFT) latency, but existing disk-based KV cache systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge- tree (LSM-tree) architectures for scalable KV cache management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-preserving storage engine that maintains token sequence locality while efficiently storing large KV cache tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch opera- tions and automatic resource management for production deployment. Evaluation on large-scale dynamic workloads demonstrates that SGLANG-LSM significantly improves cache hits by up to 143% and reduces TTFT by up to 24% compared to state-of-the-art systems, representing the first systematic application of database storage architectures to large-scale LLM cache management.