๐ค AI Summary
To address the dual bottlenecks of bandwidth-constrained KV retrieval and decompression-induced computational interference in distributed prefix caching under low-bandwidth conditions, this paper proposes the first SmartNIC-based hardware-offloading solution for interference-free KV cache acceleration. Our approach innovatively decouples the control and data planes, fully offloading full-KV transmission, compression/decompression, and prefix reuse to the SmartNIC. We further design a block-wise pipelined execution model and a minimal-copy memory mechanism to overcome SmartNIC resource constraints. Experimental evaluation under โค20 Gbps network bandwidth demonstrates that, compared to state-of-the-art methods, our solution reduces Time Per Output Token (TPOT) by 2.2ร, shortens Time To First Token (TTFT) by 1.38ร, and improves throughput by 1.35รโsignificantly enhancing inference efficiency for long-context LLM serving.
๐ Abstract
Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation.
We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.