🤖 AI Summary
To address the interconnect bottleneck in wafer-scale distributed DNN training—characterized by stringent bandwidth and latency requirements and poor adaptability to heterogeneous parallelism strategies (data, model, and pipeline parallelism)—this paper proposes the first wafer-scale interconnect architecture that natively supports in-switch collective communication (e.g., AllReduce, AllGather). The architecture integrates hardware-level topology-aware routing, dynamic bandwidth scheduling, and in-network traffic compression to jointly optimize diverse parallel communication patterns at the hardware level. Evaluated against a baseline 2D mesh wafer interconnect, our design achieves end-to-end training speedups of 1.76× (ResNet-152), 1.87× (Transformer-17B), 1.34× (GPT-3), and 1.40× (Transformer-1T). This work marks the first demonstration of co-optimized communication efficiency and parallel flexibility on wafer-scale systems.
📝 Abstract
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.