🤖 AI Summary
To address the performance bottleneck in traditional network stacks caused by the mismatch between CPU and network bandwidth growth rates, this paper proposes a novel SmartNIC-centric programmable network stack. Our approach innovatively designs: (1) a header-field offloading transmission path; (2) cache-aware infinite working-set receive processing; (3) a pure-DMA zero-copy notification channel; and (4) a programmable offload engine supporting software-defined transport layers. Implemented on NVIDIA BlueField-3, the stack maintains full compatibility with the RDMA IB Verbs (IBV) interface. Evaluation demonstrates 2.2× higher throughput than the baseline in disaggregated block storage workloads, 1.3× improvement over hardware-offloaded baselines in KVCache transfer scenarios, and end-to-end line-rate packet processing. This work is the first to simultaneously achieve transport-layer software programmability, high flexibility, and line-rate performance.
📝 Abstract
As the gap between network and CPU speeds rapidly increases, the CPU-centric network stack proves inadequate due to excessive CPU and memory overhead. While hardware-offloaded network stacks alleviate these issues, they suffer from limited flexibility in both control and data planes. Offloading network stack to off-path SmartNIC seems promising to provide high flexibility; however, throughput remains constrained by inherent SmartNIC architectural limitations. To this end, we design FlexiNS, a SmartNIC-centric network stack with software transport programmability and line-rate packet processing capabilities. To grapple with the limitation of SmartNIC-induced challenges, FlexiNS introduces: (a) a header-only offloading TX path; (b) an unlimited-working-set in-cache processing RX path; (c) a high-performance DMA-only notification pipe; and (d) a programmable offloading engine. We prototype FlexiNS using Nvidia BlueField-3 SmartNIC and provide out-of-the-box RDMA IBV verbs compatibility to users. FlexiNS achieves 2.2$ imes$ higher throughput than the microkernel-based baseline in block storage disaggregation and 1.3$ imes$ higher throughput than the hardware-offloaded baseline in KVCache transfer.