On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

๐Ÿ“… 2024-11-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key limitations of latent diffusion models (LDMs) in text-guided generation, multi-scale control, and training efficiency. To this end, we propose a novel conditional injection mechanism that decouples semantic (textual) conditions from control conditions (e.g., crop size, flip flag), and design a multi-scale pretraining transfer framework with separated conditional pathways. We systematically evaluate our approach on ImageNet-1k and CC12M. Results show substantial improvements in both class-conditional and text-to-image generation: FID improves by 7% (256ร—256) and 8% (512ร—512) on ImageNet-1k, and by 8% (256ร—256) and 23% (512ร—512) on CC12Mโ€”establishing new state-of-the-art performance on both benchmarks. Our method enhances cross-resolution representation transferability and conditional modeling flexibility, offering a novel paradigm for controllable, efficient, and reproducible diffusion modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency. To ensure apple-to-apple comparisons, we re-implement five previously published models with their corresponding recipes. Through our study, we explore the effects of (i)~the mechanisms used to condition the generative model on semantic information (e.g., text prompt) and control metadata (e.g., crop size, random flip flag, etc.) on the model performance, and (ii)~the transfer of the representations learned on smaller and lower-resolution datasets to larger ones on the training efficiency and model performance. We then propose a novel conditioning mechanism that disentangles semantic and control metadata conditionings and sets a new state-of-the-art in class-conditional generation on the ImageNet-1k dataset -- with FID improvements of 7% on 256 and 8% on 512 resolutions -- as well as text-to-image generation on the CC12M dataset -- with FID improvements of 8% on 256 and 23% on 512 resolution.
Problem

Research questions and friction points this paper is trying to address.

Large-scale training
Latent Diffusion Models
Image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image Quality Enhancement
Latent Diffusion Models
Training Efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tariq Berrada Ifriqi
FAIR at Meta, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France
Pietro Astolfi
Pietro Astolfi
FAIR - Meta
Melissa Hall
Melissa Hall
Research Engineer, Facebook
Algorithmic FairnessMachine Learning
R
Reyhane Askari-Hemmat
FAIR at Meta
Y
Yohann Benchetrit
FAIR at Meta
Marton Havasi
Marton Havasi
Meta - FAIR
M
Matthew J. Muckley
FAIR at Meta
A
Alahari Karteek
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France
Adriana Romero-Soriano
Adriana Romero-Soriano
Fundamental AI Research, Meta
deep learningmachine learningAI
Jakob Verbeek
Jakob Verbeek
FAIR, Meta
Machine LearningComputer VisionArtificial Intelligence
M
M. Drozdzal
FAIR at Meta