Scaling Diffusion Transformers Efficiently via $mu$P

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Diffusion Transformers (DiTs) face prohibitive hyperparameter tuning costs when scaling up. This work systematically extends the Maximal Update Parameterization (μP) framework to diffusion models for the first time, rigorously proving that mainstream architectures—including DiT, U-ViT, PixArt-α, and MMDiT—share the same μP scaling law as standard Transformers, thereby establishing the first theoretically grounded, efficient scaling framework for diffusion models. We propose a joint learning-rate and initialization scaling strategy tailored to diffusion objectives. Empirically, this accelerates convergence by 2.9× on DiT-XL-2; enables scaling PixArt-α to 0.61B parameters and MMDiT to 18B parameters—both surpassing baseline performance—while reducing hyperparameter tuning cost to just 5.5% of a single training run for PixArt-α and to only 3% of expert manual tuning for MMDiT.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $mu$P on text-to-image generation by scaling PixArt-$alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $mu$P as a principled and efficient framework for scaling diffusion Transformers.

Problem

Research questions and friction points this paper is trying to address.

Extending μP to diffusion Transformers for efficient scaling

Validating μP's effectiveness in large-scale diffusion Transformer experiments

Reducing hyperparameter tuning costs in vision generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes μP to diffusion Transformers effectively

Enables robust hyperparameter transferability in DiT-μP

Reduces tuning costs significantly for large-scale models

🔎 Similar Papers

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing