🤖 AI Summary
In cross-datacenter RDMA-based AI training, millisecond-scale propagation latency severely degrades the efficiency of conventional reliability mechanisms such as Selective Repeat (SR), while alternatives like Erasure Coding (EC) suffer from insufficient hardware support. This paper proposes SDR, a software-defined lightweight reliability architecture. SDR extends RDMA semantics via a receiver-side bitmap-enabled buffer to enable partial message completion and zero-copy operation. It further pioneers offloading reliability processing to NVIDIA Data Processing Units (DPUs), supporting dynamic, line-rate programmable switching among diverse strategies—including SR and EC—without modifying NIC firmware or hardware. Evaluated under high packet-loss, long-haul links, SDR significantly improves throughput stability while maintaining full software deployability. By decoupling reliability logic from fixed hardware semantics, SDR establishes a new network-stack paradigm for AI training that jointly achieves robustness, performance, and deployment flexibility.
📝 Abstract
RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for intra-datacenter training.