🤖 AI Summary
Diffusion models face a fundamental trade-off between generation quality and computational efficiency: latent diffusion models (LDMs) are efficient but suffer from information loss and non-end-to-end training, while pixel-space models avoid VAE bottlenecks yet incur prohibitive computational costs due to high-resolution modeling. This paper introduces DiP—the first efficient end-to-end pixel-space diffusion framework—decoupling generation into two stages: global structure modeling and local detail restoration. DiP innovatively employs a Diffusion Transformer to process large image patches for capturing long-range dependencies, and introduces a lightweight, context-aware Patch Detailer Head to recover high-frequency details. On ImageNet 256×256, DiP achieves a state-of-the-art FID of 1.90, with inference speed 10× faster than prior pixel-space methods and only a 0.3% parameter increase—marking the first time a pixel-space model attains inference efficiency comparable to LDMs.
📝 Abstract
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$ imes$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256$ imes$256.