🤖 AI Summary
Asynchronous pipeline parallelism suffers from gradient staleness, leading to slow convergence and inferior performance compared to synchronous baselines.
Method: This paper proposes PipeNAG—the first Nesterov Accelerated Gradient (NAG) variant tailored for asynchronous pipeline parallelism. PipeNAG reformulates the forward prediction step by introducing a delay-aware momentum correction, explicitly mitigating the weight–gradient asynchrony induced by communication latency.
Contribution/Results: We provide theoretical guarantees showing that PipeNAG maintains a sublinear convergence rate even under fixed communication delays. Empirically, on a 1-billion-parameter decoder model, PipeNAG significantly outperforms existing asynchronous methods and even surpasses the synchronous baseline—achieving 100% pipeline utilization and higher training throughput. To our knowledge, this is the first work to successfully integrate the Nesterov momentum mechanism into asynchronous pipeline optimization, thereby breaking a long-standing performance bottleneck in asynchronous distributed training.
📝 Abstract
Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.