DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical bottleneck in multi-turn LLM agent inference under disaggregated architectures, where KV-Cache storage I/O saturates the network bandwidth of the prefill engine while leaving the decode engine underutilized, severely limiting system throughput. To overcome this, the authors propose a dual-path KV-Cache loading mechanism that introduces a novel direct path from storage to the decode engine alongside the conventional storage-to-prefill path. Leveraging RDMA for efficient cache transfer and a global scheduler for dynamic load balancing, this approach is the first to deliver KV-Cache directly to the decode engine, effectively alleviating storage network congestion and significantly improving resource utilization. Experiments on real-world agent workloads demonstrate up to 1.87× higher offline inference throughput and an average 1.96× increase in online serving throughput, all while strictly meeting service-level objectives (SLOs).

Technology Category

Application Category

📝 Abstract
The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.
Problem

Research questions and friction points this paper is trying to address.

KV-Cache
storage bandwidth bottleneck
agentic LLM inference
disaggregated architecture
inference throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Cache
DualPath
RDMA
disaggregated architecture
LLM inference
🔎 Similar Papers
No similar papers found.
Y
Yongtong Wu
School of Computer Science, Peking University
S
Shaoyuan Chen
Tsinghua University
Yinmin Zhong
Yinmin Zhong
Peking University
Machine Learning SystemDistributed System
R
Rilin Huang
School of Computer Science, Peking University
Y
Yixuan Tan
DeepSeek-AI
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
L
Liyue Zhang
DeepSeek-AI
S
Shangyan Zhou
DeepSeek-AI
Y
Yuxuan Liu
DeepSeek-AI
S
Shunfeng Zhou
DeepSeek-AI
M
Mingxing Zhang
Tsinghua University
Xin Jin
Xin Jin
Peking University
Computer NetworksComputer SystemsCloud Computing
P
Panpan Huang
DeepSeek-AI