π€ AI Summary
Existing diffusion Transformers rely on pre-trained VAE latent spaces, suffering from accumulated errors across two-stage training and decoder-induced artifacts; direct pixel-space modeling instead necessitates complex cascaded pipelines. This work proposes PixNerdβthe first diffusion model to integrate neural fields for explicit pixel-space image generation, enabling single-stage, end-to-end training. PixNerd replaces the VAE decoder with a neural field that directly models image content at the pixel-block level, eliminating cascades and multi-scale architectures. On ImageNet 256Γ256 and 512Γ512, it achieves FID scores of 2.15 and 2.84, respectively. PixNerd-XXL/16 attains comprehensive scores of 0.73 on GenEval and 80.9 on DPG, demonstrating the effectiveness and scalability of neural-field-driven, single-scale diffusion modeling.
π Abstract
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256 imes256$ and 2.84 FID on ImageNet $512 imes512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.