🤖 AI Summary
This work addresses the critical memory bottleneck in long-context large language model inference, where KV caches often exceed GPU or CPU memory capacity and must be offloaded to SSDs. Existing approaches suffer from severe GPU stalls due to fine-grained random I/O and CPU involvement in the critical path. To overcome this, we propose the first GPU-centric SSD-based KV cache architecture, which removes the CPU from the critical path by leveraging GPU-native object abstractions, GPU-resident io_uring for asynchronous direct I/O, and slack-aware scheduling—limiting CPU involvement to asynchronous loading of I/O kernels only. Integrated with vLLM, our system saturates NVMe bandwidth with near-zero GPU stalls, achieving a 78.3% reduction in time-to-first-token latency, 2× higher throughput, and 27% lower serving cost compared to current GDS-based solutions, while closely matching DRAM-only performance and enabling virtually unbounded cache capacity.
📝 Abstract
LLM serving relies on prefix caching to improve inference performance. As growing contexts push key-value (KV) cache footprint far beyond GPU HBM and CPU DRAM capacity, KV cache is increasingly offloaded to NVMe SSDs. Unfortunately, restoring KV cache from SSDs suffers from poor I/O performance and incurs significant GPU stalls. This is primarily because the fragmented GPU memory layout results in a massive number of tiny random I/Os, rendering the low-parallelism CPU a severe bottleneck even with GPU Direct Storage (GDS), which still relies on CPU intervention to initiate each I/O and thus remains CPU-centric. This paper presents Tutti, an efficient SSD-backed KV caching solution that eliminates CPU intervention from the critical data and I/O control paths between HBM and SSDs. At the core of Tutti is a GPU-centric KV cache object store, in which the CPU is only responsible for asynchronously loading I/O kernels once per layer to the GPU. Tutti saturates NVMe SSD bandwidth and reduces GPU stalls to near zero through the following designs: (i) we provide a GPU-native object abstraction that enables bulk KV cache transfers and management; (ii) we re-architect the GPU storage stack by introducing GPU io_uring to support asynchronous GPU direct object I/O; and (iii) we propose slack-aware I/O scheduling to avoid GPU resource contention. We have implemented Tutti and integrated it to vLLM. Extensive evaluation shows that compared to the state-of-the-art GDS-enabled, SSD-backed LMCache, Tutti reduces TTFT by 78.3% under strict SLO constraints and improves the achievable request rate by 2x. The serving cost is reduced by 27%. Tutti achieves nearly the same inference performance as DRAM-backed LMCache, while providing almost infinite capacity.