DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The representational evolution mechanism in Diffusion Transformers (DiTs) remains poorly understood, particularly due to the lack of effective modeling of cross-layer representational diversity, which limits their generative performance. This work systematically analyzes the internal representational dynamics of DiTs and, for the first time, reveals the critical role of inter-block representational diversity in enabling efficient learning. To address this, we propose DiverseDiT, a novel framework that explicitly encourages layers to learn distinct features through long-range residual connections and a dedicated representational diversity loss. Our approach complements existing representation learning strategies and achieves significant improvements in both generation quality and convergence speed on ImageNet at 256×256 and 512×512 resolutions. Moreover, DiverseDiT demonstrates broad applicability across various backbone architectures and challenging settings such as single-step generation.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
representation learning
representation diversity
internal representations
DiT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers
Representation Diversity
Long Residual Connections
Diversity Loss
Visual Synthesis
🔎 Similar Papers
No similar papers found.