Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the accumulation of compression errors caused by cross-layer propagation of activations and gradients in decentralized model-parallel training, this paper proposes the first forward/backward joint compression framework tailored for model parallelism. Our method introduces a predefined low-dimensional subspace based on the recursive structure of Transformers, enabling lossless reconstruction of activations and gradients with zero convergence degradation. We further design an inter-layer error correction mechanism and a model-parallelism-aware communication scheduling strategy. Experiments demonstrate that our approach achieves up to 99% communication compression ratio and a 100× improvement in communication efficiency. Notably, it successfully trains billion-parameter models on consumer-grade networks with only 80 Mbps bandwidth—matching the convergence performance attained in 100 Gbps data-center environments. This work bridges the gap between high-fidelity distributed training and resource-constrained edge or wide-area network settings.

Technology Category

Application Category

📝 Abstract
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.
Problem

Research questions and friction points this paper is trying to address.

Addressing communication bottlenecks in decentralized model-parallel training
Compressing activations and gradients without convergence degradation
Enabling efficient billion-parameter training on low-bandwidth networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses both forward and backward passes
Leverages recursive structure in transformers
Enables training on low-end GPUs
🔎 Similar Papers
No similar papers found.