🤖 AI Summary
This work addresses the significant GPU memory overhead of KV caches in large language model inference and the limited scalability of existing SSD-based offloading approaches, which are bottlenecked by PCIe bandwidth and single-device throughput. The authors identify and leverage the co-activation property of KV cache entries—where certain keys and values are frequently accessed together—and propose a novel multi-SSD cooperative offloading architecture. By applying offline graph clustering and multi-SSD-aware graph partitioning, highly correlated KV items are distributed across multiple SSDs, enabling parallel I/O through selective replication. A runtime mechanism dynamically refines clustering and caching policies to sustain high bandwidth utilization. Experimental results demonstrate that the proposed approach reduces I/O latency by 2.41× and improves effective bandwidth utilization by 2.72× compared to prior methods.
📝 Abstract
The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints.
In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access into parallel I/O across multiple SSDs. Specifically, Swarm clusters co-activated KV entries offline and distributes the resulting clusters across SSDs using graph-based placement with selective replication to maximize parallel I/O bandwidth. At runtime, Swarm performs load-balanced cluster retrieval and dynamically adapts clustering and caching decisions to sustain high bandwidth utilization under evolving access patterns. Evaluations show that Swarm reduces I/O time by 2.41x and improves effective bandwidth utilization by 2.72x.