Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

📅 2024-10-25

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work challenges the prevailing assumption that latent-space diffusion models are inherently superior for high-resolution image synthesis, demonstrating that end-to-end pixel-space diffusion models can achieve both high fidelity and computational efficiency. To this end, we introduce three key innovations: (1) a lightweight skip-connection architecture that eliminates complex latent encoders and redundant skip connections; (2) a sigmoid-weighted loss function that enhances high-frequency detail reconstruction; and (3) a high-resolution-prior scaling strategy coupled with guided sampling interval scheduling to improve training stability and generation consistency. Our approach achieves a state-of-the-art FID of 1.5 on ImageNet512—the first pixel-space model to do so—and establishes new SOTA results on ImageNet128, ImageNet256, and Kinetics600. These results systematically validate that well-designed pixel-space diffusion models can match—and in several cases surpass—the performance of latent-space paradigms across diverse benchmarks.

Technology Category

Application Category

📝 Abstract

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can be very competitive to latent models both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss-weighting (Kingma&Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at a high resolution with fewer parameters, rather than using more parameters at a lower resolution. Combining these with guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).

Problem

Research questions and friction points this paper is trying to address.

Challenging latent models' efficiency and quality dominance

Scaling pixel-space diffusion for high-resolution synthesis

Achieving SOTA results with simpler architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sigmoid loss-weighting with prescribed hyper-parameters

Simplified memory-efficient architecture design

High-resolution processing with fewer parameters

🔎 Similar Papers

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement