🤖 AI Summary
This work investigates how to enhance the generation quality and training stability of diffusion models while preserving architectural simplicity. Building upon the pretrained DINOv2 feature space, the authors construct a standard diffusion Transformer that relies solely on the conventional x-prediction objective, eliminating the need for specialized prediction heads or complex designs such as Riemannian transport. The key contributions include the first demonstration that a vanilla diffusion Transformer can achieve state-of-the-art performance when operating in an appropriate representation space, alongside the introduction of dimension-aware noise scheduling and a joint [CLS]-patch modeling strategy. Experiments show that the proposed model achieves an unguided FID of 1.45 and a guided FID of 1.14 on ImageNet 256×256, with 19% fewer parameters than DiT-XL and an FID of 2.0 using only a 5-step Heun solver.
📝 Abstract
Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.