Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

This work proposes SPES, a memory-efficient decentralized framework for large language model pretraining that overcomes the limitations of existing approaches constrained by per-node GPU memory. SPES enables each node to train only a subset of experts and introduces sparse expert synchronization with periodic knowledge sharing and an expert fusion warm-up strategy to ensure rapid convergence. This approach breaks the single-node memory barrier, allowing end-to-end training of a 2B MoE model on just 16 consumer-grade GPUs with 48GB memory each, achieving performance comparable to centralized training. The framework further scales successfully to train a 7B model and upgrade a 9B model, matching the quality of current centralized baselines.

Technology Category

Application Category

📝 Abstract

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

Problem

Research questions and friction points this paper is trying to address.

large language models

decentralized training

GPU memory constraints

mixture-of-experts

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized training

mixture-of-experts

memory-efficient