🤖 AI Summary
Conventional diffusion models predict noise rather than clean images, violating the manifold hypothesis—that natural data reside on a low-dimensional manifold, whereas noise does not. This undermines modeling efficiency in high-dimensional spaces.
Method: We challenge the canonical “denoising” paradigm and propose Just image Transformers (JiT), a generative framework that directly predicts clean images. JiT eliminates tokenization, pretraining, auxiliary losses, and intermediate representations; instead, it employs large-patch, pixel-level Transformers operating natively in raw image space to model the diffusion process end-to-end.
Contribution/Results: By strictly adhering to the manifold assumption, JiT improves representational efficiency and scalability. Experiments demonstrate state-of-the-art performance on ImageNet at both 256×256 and 512×512 resolutions. Notably, JiT with large patches significantly outperforms classical noise-prediction baselines, offering a principled reversion of diffusion modeling toward its denoising origin.
📝 Abstract
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$ extbf{Just image Transformers}$", or $ extbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.