PixNerd: Pixel Neural Field Diffusion

πŸ“… 2025-07-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing diffusion Transformers rely on pre-trained VAE latent spaces, suffering from accumulated errors across two-stage training and decoder-induced artifacts; direct pixel-space modeling instead necessitates complex cascaded pipelines. This work proposes PixNerdβ€”the first diffusion model to integrate neural fields for explicit pixel-space image generation, enabling single-stage, end-to-end training. PixNerd replaces the VAE decoder with a neural field that directly models image content at the pixel-block level, eliminating cascades and multi-scale architectures. On ImageNet 256Γ—256 and 512Γ—512, it achieves FID scores of 2.15 and 2.84, respectively. PixNerd-XXL/16 attains comprehensive scores of 0.73 on GenEval and 80.9 on DPG, demonstrating the effectiveness and scalability of neural-field-driven, single-scale diffusion modeling.

Technology Category

Application Category

πŸ“ Abstract
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256 imes256$ and 2.84 FID on ImageNet $512 imes512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
Problem

Research questions and friction points this paper is trying to address.

Eliminate errors from two-stage VAE-based diffusion training
Reduce complexity of pixel-space cascade pipelines
Improve efficiency in high-resolution image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural field patch-wise decoding
Single-scale single-stage solution
End-to-end efficient representation
πŸ”Ž Similar Papers
No similar papers found.