Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

📅 2026-01-12

📈 Citations: 1

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from slow training convergence, and existing acceleration methods rely on external pretrained models, limiting their flexibility and generalization. This work proposes Self-Transcendence, a novel approach that dispenses with external semantic guidance and achieves fully self-supervised training by leveraging only internal model features. Specifically, it aligns shallow-layer DiT features with VAE latent representations and enhances semantic expressiveness of intermediate features through classifier-free guidance (CFG), enabling significant training acceleration using solely internal supervision signals. Experimental results demonstrate that, without any external pretrained models, the proposed method outperforms external-guidance approaches such as REPA in both training speed and generation quality.

Technology Category

Application Category

📝 Abstract

Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at https://github.com/csslc/Self-Transcendence.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers

slow convergence

external guidance

representation learning

training acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Transcendence

Diffusion Transformers

Internal Feature Supervision