Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

📅 2026-01-22
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing text-to-image (T2I) diffusion models that rely on variational autoencoders (VAEs), which often suffer from training instability, overfitting, and constraints on generation quality. The authors propose constructing the diffusion process in a high-dimensional semantic latent space using a streamlined representation autoencoder (RAE), which simplifies the original RAE architecture by retaining only dimension-aware noise scheduling and enabling a unified representation space shared between visual understanding and generation. Leveraging a frozen SigLIP-2 encoder paired with an RAE decoder, the model is trained at scale on web-collected, synthetic, and text-rendered data. Experiments demonstrate that RAE consistently outperforms FLUX VAE across model sizes ranging from 0.5B to 9.8B parameters, achieves stable fine-tuning over 256 epochs without overfitting, and exhibits faster convergence and superior image fidelity, establishing RAE as a superior foundational architecture for large-scale T2I generation.

Technology Category

Application Category

📝 Abstract
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
diffusion models
Representation Autoencoders
scaling
latent space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation Autoencoders
Text-to-Image Generation
Diffusion Transformers
Latent Space Scaling
Multimodal Unified Models
🔎 Similar Papers