🤖 AI Summary
This work addresses the challenges of scaling distributed AI training to hundreds of thousands of GPUs, where conventional networks struggle to simultaneously achieve high throughput, low latency, and stability under dynamic loads. The authors propose Spectrum-X, a multi-plane network architecture that replaces hierarchical topologies with topological parallelism and integrates hardware-accelerated load balancing directly into NICs and switches. This design enables microsecond-scale link-state awareness and response, co-optimizing multi-plane topology and hardware-level scheduling to significantly enhance bandwidth utilization and fault resilience. Experimental results demonstrate that the system achieves 98% line-rate throughput with near-zero jitter, incurs only a 7% latency increase under 10% link failures, and supports strong multi-tenant isolation—effectively meeting the stringent communication demands of large-scale model training.
📝 Abstract
As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.