🤖 AI Summary
To address high communication overhead and substantial memory/computation costs in large-scale distributed training, this paper proposes a local parameter freezing mechanism: during multi-step local updates, only a fixed subset of parameters is optimized, while gradients for inactive parameters are disabled—eliminating both gradient computation/transmission and inter-node activation exchange. The method integrates phased parameter updates, local gradient computation, and full-parameter forward propagation within a standard synchronous framework. When training a 1.3-billion-parameter language model across 32 nodes, it achieves perplexity comparable to the baseline under identical communication bandwidth and data volume, while reducing peak memory consumption by 27% and training FLOPs by 22%. The core innovation lies in replacing global synchronization with parameter-sparse updates, enabling simultaneous optimization of communication, computation, and memory efficiency without compromising model accuracy.
📝 Abstract
We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.