Near-optimal sparse allreduce for distributed deep learning

📅 2022-01-19

🏛️ ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

📈 Citations: 55

✨ Influential: 6

career value

180K/year

🤖 AI Summary

Communication overhead severely limits scalability in large-scale distributed deep learning training. To address the practical challenges of gradient sparsification—namely, low efficiency of sparse AllReduce algorithms and high computational cost of top-k selection—this paper proposes Ok-Topk, a novel framework. First, it introduces the first provably asymptotically optimal sparse AllReduce protocol achieving communication volume below 6k per iteration. Second, it designs an efficient top-k selection mechanism based on dynamic threshold estimation, substantially reducing sparsification computation overhead. Third, it integrates sparse gradient compression with decentralized parallel SGD, ensuring theoretical convergence guarantees while improving scalability. Evaluated on BERT training across 256 GPUs on Piz Daint, Ok-Topk achieves 3.29–12.95× higher throughput than baseline dense AllReduce, matches its accuracy, and significantly outperforms state-of-the-art methods in scalability.

📝 Abstract

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes Ok-Topk, a scheme for distributed training with sparse gradients. Ok-Topk integrates a novel sparse allreduce algorithm (less than 6k communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, Ok-Topk efficiently selects the top-k gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that Ok-Topk achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, Ok-Topk is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in distributed deep learning training

Developing scalable and efficient sparse allreduce algorithms

Minimizing sparsification overhead while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse allreduce algorithm with asymptotically optimal volume

Decentralized parallel SGD optimizer integration

Efficient top-k gradient selection via threshold estimation

🔎 Similar Papers

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost