Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models rely on pre-trained visual representations to alleviate training difficulties but lack external alignment during denoising inference, limiting their ability to fully exploit discriminative semantics. To address this, we propose Representation Entanglement for Generation (REG), which end-to-end entangles high-level class tokens from a pre-trained model with image latent variables in a diffusion Transformer—introducing only a single learnable token—to enable semantically consistent generation from pure noise. REG intrinsically embeds a semantic reconstruction mechanism into the denoising process, jointly optimizing generative and discriminative representations without external alignment. On ImageNet 256×256, SiT-XL/2 augmented with REG achieves 63× faster training convergence than the baseline; it surpasses SiT-XL/2 + REPA trained for 4 million steps after only 400k steps, while incurring less than 0.5% increase in both FLOPs and inference latency.

Technology Category

Application Category

📝 Abstract
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$ imes$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $ extbf{63} imes$ and $ extbf{23} imes$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($ extbf{10} imes$ longer). Code is available at: https://github.com/Martinser/REG.
Problem

Research questions and friction points this paper is trying to address.

Improving diffusion model training efficiency
Enhancing image generation quality
Reducing inference overhead with minimal tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entangles image latents with class token
Improves generation quality and efficiency
Minimal inference overhead with one token
🔎 Similar Papers
No similar papers found.