Back to Basics: Let Denoising Generative Models Denoise

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Conventional diffusion models predict noise rather than clean images, violating the manifold hypothesis—that natural data reside on a low-dimensional manifold, whereas noise does not. This undermines modeling efficiency in high-dimensional spaces. Method: We challenge the canonical “denoising” paradigm and propose Just image Transformers (JiT), a generative framework that directly predicts clean images. JiT eliminates tokenization, pretraining, auxiliary losses, and intermediate representations; instead, it employs large-patch, pixel-level Transformers operating natively in raw image space to model the diffusion process end-to-end. Contribution/Results: By strictly adhering to the manifold assumption, JiT improves representational efficiency and scalability. Experiments demonstrate state-of-the-art performance on ImageNet at both 256×256 and 512×512 resolutions. Notably, JiT with large patches significantly outperforms classical noise-prediction baselines, offering a principled reversion of diffusion modeling toward its denoising origin.

Technology Category

Application Category

📝 Abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$ extbf{Just image Transformers}$", or $ extbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Problem

Research questions and friction points this paper is trying to address.

Models should predict clean data instead of noise

Natural data lies on low-dimensional manifolds unlike noise

Simple Transformers can effectively generate images without preprocessing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly predicts clean data using Transformers

Uses large patch sizes without tokenizer or pre-training

Operates effectively on high-dimensional natural data manifolds

🔎 Similar Papers

$epsilon$-VAE: Denoising as Visual Decoding