RiT: Vanilla Diffusion Transformers Suffice in Representation Space

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work investigates how to enhance the generation quality and training stability of diffusion models while preserving architectural simplicity. Building upon the pretrained DINOv2 feature space, the authors construct a standard diffusion Transformer that relies solely on the conventional x-prediction objective, eliminating the need for specialized prediction heads or complex designs such as Riemannian transport. The key contributions include the first demonstration that a vanilla diffusion Transformer can achieve state-of-the-art performance when operating in an appropriate representation space, alongside the introduction of dimension-aware noise scheduling and a joint [CLS]-patch modeling strategy. Experiments show that the proposed model achieves an unguided FID of 1.45 and a guided FID of 1.14 on ImageNet 256×256, with 19% fewer parameters than DiT-XL and an FID of 2.0 using only a 5-step Heun solver.

📝 Abstract

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

Problem

Research questions and friction points this paper is trying to address.

flow matching

representation space

diffusion transformer

image generation

manifold structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Flow Matching

Representation Learning