Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant GPU memory overhead of KV caches in large language model inference and the limited scalability of existing SSD-based offloading approaches, which are bottlenecked by PCIe bandwidth and single-device throughput. The authors identify and leverage the co-activation property of KV cache entries—where certain keys and values are frequently accessed together—and propose a novel multi-SSD cooperative offloading architecture. By applying offline graph clustering and multi-SSD-aware graph partitioning, highly correlated KV items are distributed across multiple SSDs, enabling parallel I/O through selective replication. A runtime mechanism dynamically refines clustering and caching policies to sustain high bandwidth utilization. Experimental results demonstrate that the proposed approach reduces I/O latency by 2.41× and improves effective bandwidth utilization by 2.72× compared to prior methods.

Technology Category

Application Category

📝 Abstract
The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints. In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access into parallel I/O across multiple SSDs. Specifically, Swarm clusters co-activated KV entries offline and distributes the resulting clusters across SSDs using graph-based placement with selective replication to maximize parallel I/O bandwidth. At runtime, Swarm performs load-balanced cluster retrieval and dynamically adapts clustering and caching decisions to sustain high bandwidth utilization under evolving access patterns. Evaluations show that Swarm reduces I/O time by 2.41x and improves effective bandwidth utilization by 2.72x.
Problem

Research questions and friction points this paper is trying to address.

KVCache
offloading
SSD
bandwidth bottleneck
large language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

KVCache Co-Activation
Multi-SSD Offloading
Parallel I/O
Graph-based Placement
Dynamic Clustering
🔎 Similar Papers
No similar papers found.
T
Tuowei Wang
Tsinghua University
L
Liyun Chu
Tsinghua University
R
Ruwen Fan
Tsinghua University
Ju Ren
Ju Ren
Department of Computer Science and Technology, Tsinghua University
Internet-of-ThingsEdge Computing/IntelligenceSecurity and Privacy