ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the dual bottlenecks of bandwidth-constrained KV retrieval and decompression-induced computational interference in distributed prefix caching under low-bandwidth conditions, this paper proposes the first SmartNIC-based hardware-offloading solution for interference-free KV cache acceleration. Our approach innovatively decouples the control and data planes, fully offloading full-KV transmission, compression/decompression, and prefix reuse to the SmartNIC. We further design a block-wise pipelined execution model and a minimal-copy memory mechanism to overcome SmartNIC resource constraints. Experimental evaluation under ≤20 Gbps network bandwidth demonstrates that, compared to state-of-the-art methods, our solution reduces Time Per Output Token (TPOT) by 2.2×, shortens Time To First Token (TTFT) by 1.38×, and improves throughput by 1.35×—significantly enhancing inference efficiency for long-context LLM serving.

Technology Category

Application Category

📝 Abstract

Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.

Problem

Research questions and friction points this paper is trying to address.

KV cache fetching becomes bottleneck under limited network bandwidth

Decompression interference degrades performance during model computation

SmartNIC resource constraints challenge efficient KV cache management

Innovation

Methods, ideas, or system contributions that make the work stand out.

SmartNIC-accelerated data plane for interference-free operation

Chunked pipeline design to parallelize data plane operations

Minimal-copy memory management to reduce SmartNIC memory pressure

🔎 Similar Papers

Compute Or Load KV Cache? Why Not Both?