π€ AI Summary
Large-scale distributed deep learning suffers from scalability bottlenecks due to high global communication overhead and load imbalance. This paper proposes Wait-Avoiding Group Model Averaging SGD (WAGMA-SGD), which replaces global AllReduce with localized subgroup AllReduce and introduces the first wait-avoiding grouped AllReduce mechanism. Theoretically, WAGMA-SGD preserves the O(1/βT) convergence rate of AllReduce-SGD while eliminating global synchronization blocking. The method integrates asynchronous subgroup weight exchange, an enhanced stochastic gradient update protocol, and a novel distributed convergence analysis framework. Experiments demonstrate that WAGMA-SGD achieves a 2.1Γ throughput improvement in reinforcement learning on 1,024 GPUs; attains state-of-the-art timeβaccuracy trade-offs for Transformer-based machine translation; and significantly outperforms mainstream decentralized SGD methods in scaling efficiency for ResNet-50/ImageNet training.
π Abstract
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1Γ on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).