Heterogeneous Low-Bandwidth Pre-Training of LLMs

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the challenge of scaling large language model pretraining under bandwidth-constrained environments, where frequent communication hinders distributed training efficiency. The authors propose a heterogeneous distributed training framework that uniquely integrates SparseLoCo—a low-communication data parallelism strategy—with a subspace-projection-based low-bandwidth pipeline model parallelism. A selective compression scheme is introduced to efficiently compress both activations and gradients, enabling resource-constrained devices to participate collaboratively in training. Experiments on models ranging from 178M to 1B parameters demonstrate the method’s effectiveness: compared to uniform global compression, selective compression significantly reduces training loss at high compression ratios while maintaining both communication efficiency and model performance.

Technology Category

Application Category

📝 Abstract

Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

Problem

Research questions and friction points this paper is trying to address.

large language models

low-bandwidth

heterogeneous training

model parallelism

distributed pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

SparseLoCo

pipeline parallelism

activation compression