DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
This work addresses the challenge of KV cache memory overflow when deploying large language models on edge devices, where existing NVMe offloading schemes based on file systems suffer from cache thrashing, high latency, and excessive software overhead under memory pressure. To overcome these limitations, the authors propose a dual-path KV cache residency framework that dynamically routes KV tensors—based on runtime memory conditions—either to the page cache or directly to NVMe via a file-system-bypass path. The design incorporates logically contiguous block address mapping and adaptive pipelined parallelism to overlap storage I/O with GPU DMA transfers. This approach enables the first memory-aware dynamic KV cache scheduling mechanism, significantly reducing offloading overhead: it achieves up to 33.1% and 42.4% latency reduction in the prefill and decode phases, respectively, and improves SSD utilization by 2.2×, effectively alleviating I/O bottlenecks.
📝 Abstract
The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, improving inference throughput. Our evaluation shows that DUAL-BLADE substantially mitigates I/O bottlenecks, reducing prefill and decode latency by up to 33.1% and 42.4%, respectively, while improving SSD utilization by 2.2x across diverse memory budgets.
Problem

Research questions and friction points this paper is trying to address.

KV-cache
edge LLM inference
memory budget
NVMe offloading
cache thrashing
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache offloading
NVMe-direct
dual-path framework
edge LLM inference
adaptive pipeline parallelism
🔎 Similar Papers
2024-10-04arXiv.orgCitations: 1