PixNerd: Pixel Neural Field Diffusion

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing diffusion Transformers rely on pre-trained VAE latent spaces, suffering from accumulated errors across two-stage training and decoder-induced artifacts; direct pixel-space modeling instead necessitates complex cascaded pipelines. This work proposes PixNerd—the first diffusion model to integrate neural fields for explicit pixel-space image generation, enabling single-stage, end-to-end training. PixNerd replaces the VAE decoder with a neural field that directly models image content at the pixel-block level, eliminating cascades and multi-scale architectures. On ImageNet 256×256 and 512×512, it achieves FID scores of 2.15 and 2.84, respectively. PixNerd-XXL/16 attains comprehensive scores of 0.73 on GenEval and 80.9 on DPG, demonstrating the effectiveness and scalability of neural-field-driven, single-scale diffusion modeling.

Technology Category

Application Category

📝 Abstract

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256 imes256$ and 2.84 FID on ImageNet $512 imes512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Problem

Research questions and friction points this paper is trying to address.

Eliminate errors from two-stage VAE-based diffusion training

Reduce complexity of pixel-space cascade pipelines

Improve efficiency in high-resolution image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural field patch-wise decoding

Single-scale single-stage solution

End-to-end efficient representation

🔎 Similar Papers

From paintbrush to pixel: A review of deep neural networks in AI-generated art