ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

๐Ÿ“… 2025-09-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the dual bottlenecks of bandwidth-constrained KV retrieval and decompression-induced computational interference in distributed prefix caching under low-bandwidth conditions, this paper proposes the first SmartNIC-based hardware-offloading solution for interference-free KV cache acceleration. Our approach innovatively decouples the control and data planes, fully offloading full-KV transmission, compression/decompression, and prefix reuse to the SmartNIC. We further design a block-wise pipelined execution model and a minimal-copy memory mechanism to overcome SmartNIC resource constraints. Experimental evaluation under โ‰ค20 Gbps network bandwidth demonstrates that, compared to state-of-the-art methods, our solution reduces Time Per Output Token (TPOT) by 2.2ร—, shortens Time To First Token (TTFT) by 1.38ร—, and improves throughput by 1.35ร—โ€”significantly enhancing inference efficiency for long-context LLM serving.

Technology Category

Application Category

๐Ÿ“ Abstract
Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.
Problem

Research questions and friction points this paper is trying to address.

KV cache fetching becomes bottleneck under limited network bandwidth
Decompression interference degrades performance during model computation
SmartNIC resource constraints challenge efficient KV cache management
Innovation

Methods, ideas, or system contributions that make the work stand out.

SmartNIC-accelerated data plane for interference-free operation
Chunked pipeline design to parallelize data plane operations
Minimal-copy memory management to reduce SmartNIC memory pressure
๐Ÿ”Ž Similar Papers
No similar papers found.