🤖 AI Summary
In large-scale distributed machine learning (ML), collective communication suffers from high tail latency, while existing RDMA protocols (e.g., RoCE) exhibit poor scalability due to stringent reliability and in-order delivery guarantees. Method: This paper proposes a lightweight RDMA architecture tailored to ML workloads—intentionally relaxing strict reliability and ordering requirements to exploit ML’s inherent tolerance to partial packet loss. Retransmission and error correction are offloaded to the upper-layer training pipeline; the design employs best-effort transmission, adaptive timeout, priority-based data scheduling, retains DCQCN for congestion control, and integrates Hadamard transform for efficient fault tolerance. Contribution/Results: Experiments show a 2.3× reduction in 99th-percentile latency, 67% reduction in BRAM resource usage, and nearly 2× improvement in NIC fault recovery time—significantly enhancing scalability and resilience of ML communication in ten-thousand-GPU clusters.
📝 Abstract
As distributed machine learning (ML) workloads scale to thousands of GPUs connected by ultra-high-speed inter-connects, tail latency in collective communication has emerged as a primary bottleneck. Prior RDMA designs, like RoCE, IRN, and SRNIC, enforce strict reliability and in-order delivery, relying on retransmissions and packet sequencing to ensure correctness. While effective for general-purpose workloads, these mechanisms introduce complexity and latency that scale poorly, where even rare packet losses or delays can consistently degrade system performance. We introduce Celeris, a domain-specific RDMA transport that revisits traditional reliability guarantees based on ML's tolerance for lost or partial data. Celeris removes retransmissions and in-order delivery from the RDMA NIC, enabling best-effort transport that exploits the robustness of ML workloads. It retains congestion control (e.g., DCQCN) and manages communication with software-level mechanisms such as adaptive timeouts and data prioritization, while shifting loss recovery to the ML pipeline (e.g., using the Hadamard Transform). Early results show that Celeris reduces 99th-percentile latency by up to 2.3x, cuts BRAM usage by 67%, and nearly doubles NIC resilience to faults -- delivering a resilient, scalable transport tailored for ML at cluster scale.