🤖 AI Summary
This work addresses the challenge of efficiently training high-quality text-to-image diffusion models without centralized coordination infrastructure. We propose the first fully decentralized diffusion model pretraining framework: the model is partitioned into eight independent expert subnetworks, each trained autonomously—without parameter or gradient synchronization—and a lightweight semantic-aware Transformer router dynamically assigns data to experts via clustering, enabling distributed training across heterogeneous hardware. Our method achieves comparable generation quality using only 1/14 the data volume and 1/16 the computational resources required by centralized baselines. To our knowledge, this is the first open-source, commercially viable, fully decentralized pretraining framework for text-to-image diffusion models. It empirically validates the feasibility and efficiency of decentralized training in generative AI, establishing a new paradigm for privacy-preserving learning, computational democratization, and federated AI research.
📝 Abstract
We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris's decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$ imes$ less training data and 16$ imes$ less compute than the prior decentralized baseline.