SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cross-datacenter RDMA-based AI training, millisecond-scale propagation latency severely degrades the efficiency of conventional reliability mechanisms such as Selective Repeat (SR), while alternatives like Erasure Coding (EC) suffer from insufficient hardware support. This paper proposes SDR, a software-defined lightweight reliability architecture. SDR extends RDMA semantics via a receiver-side bitmap-enabled buffer to enable partial message completion and zero-copy operation. It further pioneers offloading reliability processing to NVIDIA Data Processing Units (DPUs), supporting dynamic, line-rate programmable switching among diverse strategies—including SR and EC—without modifying NIC firmware or hardware. Evaluated under high packet-loss, long-haul links, SDR significantly improves throughput stability while maintaining full software deployability. By decoupling reliability logic from fixed hardware semantics, SDR establishes a new network-stack paradigm for AI training that jointly achieves robustness, performance, and deployment flexibility.

Technology Category

Application Category

📝 Abstract
RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve line-rate performance, enabling efficient inter-datacenter communication and advancing reliability innovation for intra-datacenter training.
Problem

Research questions and friction points this paper is trying to address.

Address inefficiency of Selective Repeat in RDMA over long-haul links
Enable custom reliability schemes without hardware changes
Achieve line-rate performance for inter-datacenter RDMA communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

SDR-RDMA extends RDMA with receive buffer bitmap
Enables custom reliability schemes via partial completion
Offloads SDR backend to DPA for line-rate performance
🔎 Similar Papers
No similar papers found.
M
Mikhail Khalilov
ETH Zurich, Zurich, Switzerland
Siyuan Shen
Siyuan Shen
School of Information Science and Technology, ShanghaiTech University
Computer visionComputational photography
M
Marcin Chrapek
ETH Zurich, Zurich, Switzerland
T
Tiancheng Chen
ETH Zurich, Zurich, Switzerland
K
Kenji Nakano
ETH Zurich, Zurich, Switzerland
P
Peter-Jan Gootzen
NVIDIA, Santa Clara, United States of America
S
S. D. Girolamo
NVIDIA, Santa Clara, United States of America
R
Rami Nudelman
NVIDIA, Santa Clara, United States of America
G
Gil Bloch
NVIDIA, Santa Clara, United States of America
S
S. Anantharamu
Microsoft Corporation, Redmond, United States of America
M
Mahmoud Elhaddad
Microsoft Corporation, Redmond, United States of America
J
Jithin Jose
Microsoft Corporation, Redmond, United States of America
Abdul Kabbani
Abdul Kabbani
Principal Architect, Microsoft and Adjunct Associate Professor, University of California
Systems and Networking
S
Scott Moe
Microsoft Corporation, Redmond, United States of America
Konstantin Taranov
Konstantin Taranov
ETH Zurich
Networked systems
Zhuolong Yu
Zhuolong Yu
Microsoft
J
Jie Zhang
Microsoft Corporation, Redmond, United States of America
N
Nicola Mazzoletti
Swiss National Supercomputing Centre (CSCS), Lugano, Switzerland
Torsten Hoefler
Torsten Hoefler
Professor of Computer Science at ETH Zurich
High Performance ComputingDeep LearningNetworkingMessage Passing InterfaceParallel and Distributed Computing