DiP: Taming Diffusion Models in Pixel Space

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models face a fundamental trade-off between generation quality and computational efficiency: latent diffusion models (LDMs) are efficient but suffer from information loss and non-end-to-end training, while pixel-space models avoid VAE bottlenecks yet incur prohibitive computational costs due to high-resolution modeling. This paper introduces DiP—the first efficient end-to-end pixel-space diffusion framework—decoupling generation into two stages: global structure modeling and local detail restoration. DiP innovatively employs a Diffusion Transformer to process large image patches for capturing long-range dependencies, and introduces a lightweight, context-aware Patch Detailer Head to recover high-frequency details. On ImageNet 256×256, DiP achieves a state-of-the-art FID of 1.90, with inference speed 10× faster than prior pixel-space methods and only a 0.3% parameter increase—marking the first time a pixel-space model attains inference efficiency comparable to LDMs.

Technology Category

Application Category

📝 Abstract
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$ imes$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256$ imes$256.
Problem

Research questions and friction points this paper is trying to address.

Resolving trade-off between generation quality and computational efficiency
Overcoming information loss in latent diffusion models
Enabling high-resolution synthesis without VAE dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples generation into global and local stages
Uses Diffusion Transformer backbone for global structure
Employs lightweight Patch Detailer for local details
🔎 Similar Papers
No similar papers found.
Z
Zhennan Chen
Nanjing University
Junwei Zhu
Junwei Zhu
Algorithm Engineer at Tencent
CV
X
Xu Chen
Tencent Youtu Lab
J
Jiangning Zhang
Tencent Youtu Lab
Xiaobin Hu
Xiaobin Hu
Tencent Youtu Lab;Technische Universität München (TUM)
Deep learningComputer visionVLMAgents
H
Hanzhen Zhao
National University of Singapore
C
Chengjie Wang
Tencent Youtu Lab
J
Jian Yang
Nanjing University
Y
Ying Tai
Nanjing University