Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging

📅 2020-04-30

🏛️ IEEE Transactions on Parallel and Distributed Systems

📈 Citations: 15

✨ Influential: 2

career value

236K/year

🤖 AI Summary

Large-scale distributed deep learning suffers from scalability bottlenecks due to high global communication overhead and load imbalance. This paper proposes Wait-Avoiding Group Model Averaging SGD (WAGMA-SGD), which replaces global AllReduce with localized subgroup AllReduce and introduces the first wait-avoiding grouped AllReduce mechanism. Theoretically, WAGMA-SGD preserves the O(1/√T) convergence rate of AllReduce-SGD while eliminating global synchronization blocking. The method integrates asynchronous subgroup weight exchange, an enhanced stochastic gradient update protocol, and a novel distributed convergence analysis framework. Experiments demonstrate that WAGMA-SGD achieves a 2.1× throughput improvement in reinforcement learning on 1,024 GPUs; attains state-of-the-art time–accuracy trade-offs for Transformer-based machine translation; and significantly outperforms mainstream decentralized SGD methods in scaling efficiency for ResNet-50/ImageNet training.

📝 Abstract

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1× on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).

Problem

Research questions and friction points this paper is trying to address.

Reducing global communication overhead in parallel stochastic optimization

Addressing load imbalance across nodes with uneven sample lengths

Maintaining convergence rates while minimizing synchronization delays

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wait-Avoiding Group Model Averaging reduces global communication

Uses subgroup weight exchange with group allreduce operation

Maintains convergence rates similar to Allreduce-SGD

🔎 Similar Papers

Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics