FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

📅 2024-06-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the interconnect bottleneck in wafer-scale distributed DNN training—characterized by stringent bandwidth and latency requirements and poor adaptability to heterogeneous parallelism strategies (data, model, and pipeline parallelism)—this paper proposes the first wafer-scale interconnect architecture that natively supports in-switch collective communication (e.g., AllReduce, AllGather). The architecture integrates hardware-level topology-aware routing, dynamic bandwidth scheduling, and in-network traffic compression to jointly optimize diverse parallel communication patterns at the hardware level. Evaluated against a baseline 2D mesh wafer interconnect, our design achieves end-to-end training speedups of 1.76× (ResNet-152), 1.87× (Transformer-17B), 1.34× (GPT-3), and 1.40× (Transformer-1T). This work marks the first demonstration of co-optimized communication efficiency and parallel flexibility on wafer-scale systems.

Technology Category

Application Category

📝 Abstract
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.
Problem

Research questions and friction points this paper is trying to address.

Enabling flexible interconnect for wafer-scale DNN training
Optimizing communication patterns for parallelization strategies
Reducing network traffic via in-switch collective execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wafer-scale interconnect for high-BW networks
Supports various parallelization strategies efficiently
In-switch collective communication reduces traffic
🔎 Similar Papers
No similar papers found.