Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 17
Influential: 2
📄 PDF
🤖 AI Summary
Diffusion models suffer from inefficient representation learning and limited generation quality due to semantically impoverished latent spaces. To address this, we propose REPA (Representation Alignment), a novel regularization method that explicitly aligns denoising latent states—corrupted by noise—with clean-image representations extracted from high-quality external vision encoders (e.g., CLIP or DINO) within diffusion Transformers (DiT/SiT). This alignment is enforced via a projection-based loss, optimized end-to-end to enhance semantic consistency in the latent space. Experiments demonstrate that REPA accelerates SiT training by over 17.5×, enabling a SiT model to match the performance of a 7M-step SiT-XL within fewer than 400K steps. With classifier-free guidance (CFG), the method achieves an FID of 1.42—setting a new state-of-the-art at the time. REPA establishes a principled paradigm for improving representation learning in diffusion models through explicit cross-architecture semantic alignment.

Technology Category

Application Category

📝 Abstract
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$ imes$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.
Problem

Research questions and friction points this paper is trying to address.

Improving representation quality in diffusion models for generation.
Enhancing training efficiency using external visual representations.
Achieving state-of-the-art generation quality with fewer training steps.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces REPA for aligning noisy and clean representations
Uses external pretrained visual encoders for better training
Significantly improves training efficiency and generation quality
🔎 Similar Papers
No similar papers found.