SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

In cross-datacenter RDMA-based AI training, millisecond-scale propagation latency severely degrades the efficiency of conventional reliability mechanisms such as Selective Repeat (SR), while alternatives like Erasure Coding (EC) suffer from insufficient hardware support. This paper proposes SDR, a software-defined lightweight reliability architecture. SDR extends RDMA semantics via a receiver-side bitmap-enabled buffer to enable partial message completion and zero-copy operation. It further pioneers offloading reliability processing to NVIDIA Data Processing Units (DPUs), supporting dynamic, line-rate programmable switching among diverse strategies—including SR and EC—without modifying NIC firmware or hardware. Evaluated under high packet-loss, long-haul links, SDR significantly improves throughput stability while maintaining full software deployability. By decoupling reliability logic from fixed hardware semantics, SDR establishes a new network-stack paradigm for AI training that jointly achieves robustness, performance, and deployment flexibility.

Technology Category

Application Category

📝 Abstract

RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for intra-datacenter training.

Problem

Research questions and friction points this paper is trying to address.

Address inefficiency of Selective Repeat in RDMA over long-haul links

Enable custom reliability schemes without hardware changes

Achieve line-rate performance for inter-datacenter RDMA communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

SDR-RDMA extends RDMA with receive buffer bitmap

Enables custom reliability schemes via partial completion

Offloads SDR backend to DPA for line-rate performance

🔎 Similar Papers

No similar papers found.