🤖 AI Summary
Large-scale diffusion model training heavily relies on centralized, high-bandwidth networks and struggles to leverage distributed heterogeneous computing resources. Method: This paper proposes the first decentralized training framework: data are partitioned across isolated clusters; expert models are trained independently; and lightweight routing integrates their outputs during inference—provably equivalent to optimizing a full centralized model. The framework enables diffusion model training without central coordination, facilitating collaboration across geographically separated “computational islands” (e.g., distinct data centers). Contributions/Results: It introduces expert model parallelism, routing-based ensemble integration, decentralized convergence analysis, and FLOP-aligned evaluation. Empirically, it achieves higher FLOP efficiency than standard diffusion models on ImageNet and LAION Aesthetics. A 24B-parameter model trains in one week using only eight GPUs.
📝 Abstract
Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of"compute islands,"lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.