Partial Parameter Updates for Efficient Distributed Training

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address high communication overhead and substantial memory/computation costs in large-scale distributed training, this paper proposes a local parameter freezing mechanism: during multi-step local updates, only a fixed subset of parameters is optimized, while gradients for inactive parameters are disabled—eliminating both gradient computation/transmission and inter-node activation exchange. The method integrates phased parameter updates, local gradient computation, and full-parameter forward propagation within a standard synchronous framework. When training a 1.3-billion-parameter language model across 32 nodes, it achieves perplexity comparable to the baseline under identical communication bandwidth and data volume, while reducing peak memory consumption by 27% and training FLOPs by 22%. The core innovation lies in replacing global synchronization with parameter-sparse updates, enabling simultaneous optimization of communication, computation, and memory efficiency without compromising model accuracy.

Technology Category

Application Category

📝 Abstract

We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.

Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in distributed training

Decreasing memory usage during backpropagation steps

Maintaining model accuracy while cutting computational costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Restricts backpropagation to fixed parameter subsets

Reduces peak memory usage and training FLOPs

Performs full forward pass without activation exchange

🔎 Similar Papers

No similar papers found.