SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

๐Ÿ“… 2024-03-14
๐Ÿ›๏ธ European Conference on Computer Vision
๐Ÿ“ˆ Citations: 7
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing diffusion models for semantic image synthesis (SIS) suffer from two key limitations: structural artifacts and misalignment between semantic masks and generated content. To address these, this paper proposes Spatial-Category Prior (SCP) modelingโ€”a novel framework that explicitly couples spatial layout and category distribution for the first time, thereby mitigating training-inference noise distribution mismatch. Built upon the ControlNet architecture within the Latent Diffusion framework, our method jointly integrates spatial priors, category priors, and the SCP to enable fine-grained, semantics-controllable generation. Extensive experiments demonstrate state-of-the-art performance across three benchmark datasets: Cityscapes, ADE20K, and COCO-Stuff. Notably, on Cityscapes, our approach achieves an FID score of 10.53โ€”marking a substantial improvement in structural consistency and semantic alignment accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.
Problem

Research questions and friction points this paper is trying to address.

Addresses weird sub-structures in large semantic areas
Resolves content misalignment with semantic masks
Corrects mismatch between training noise and inference prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-categorical joint prior for diffusion
Specific noise priors for semantic synthesis
Novel inference approach addressing distribution mismatch
๐Ÿ”Ž Similar Papers
No similar papers found.
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Mingju Gao
Mingju Gao
Unknown affiliation
Computer VisionRobotics
J
Jiaju Li
Institute for AI Industry Research (AIR), Tsinghua University and University of Chinese Academy of Sciences
W
Wenyi Li
Institute for AI Industry Research (AIR), Tsinghua University
R
Rong Zhi
Mercedes-Benz Group China Ltd.
H
Hao Tang
Carnegie Mellon University
H
Hao Zhao
Institute for AI Industry Research (AIR), Tsinghua University