Distributed Low-Communication Training with Decoupled Momentum Optimization

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high-bandwidth interconnect dependency and substantial communication overhead in distributed training of large-scale models, this paper proposes a low-communication training method based on momentum frequency-domain sparsification. The core innovation lies in modeling optimizer momentum as a time-series signal, applying the discrete cosine transform (DCT) to separate its high- and low-frequency components, and synchronizing only the information-rich high-frequency part—thereby decoupling momentum updates from gradient synchronization. Combined with periodic model replica synchronization and Nesterov momentum compression, the method significantly reduces communication load. Extensive evaluations across Transformer and CNN architectures demonstrate up to 16× reduction in communication volume compared to the DiLoCo baseline, while preserving convergence stability and model accuracy—even under low-bandwidth conditions.

Technology Category

Application Category

📝 Abstract
The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16 imes$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.
Problem

Research questions and friction points this paper is trying to address.

Reducing communication overhead in distributed model training
Optimizing momentum synchronization across distributed compute nodes
Enabling large model training with low-bandwidth interconnects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Infrequent synchronization reduces communication between nodes
Momentum compression via discrete cosine transform decomposition
High-frequency components synchronized periodically across replicas
🔎 Similar Papers
No similar papers found.