Enhanced Controllability of Diffusion Models via Feature Disentanglement and Realism-Enhanced Sampling Methods

📅 2023-02-28

🏛️ European Conference on Computer Vision

📈 Citations: 6

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the weak controllability and feature entanglement in diffusion models, which hinder precise image editing. Methodologically: (1) it introduces spatial content masks and flattened style embeddings to explicitly decouple layout and semantic features during training; (2) it is the first to embed inductive bias into the denoising process to implicitly separate structural and stylistic components; (3) it proposes a timestep-adaptive content/style conditioning weighting scheme and generalized composable diffusion sampling (GCDM). The key contribution lies in breaking the conditional independence assumption, enabling composable and editable generation of layout–style attributes. Experiments demonstrate significant improvements in generation fidelity and attribute controllability on image editing and cross-domain translation tasks, outperforming existing state-of-the-art methods.

📝 Abstract

As Diffusion Models have shown promising performance, a lot of efforts have been made to improve the controllability of Diffusion Models. However, how to train Diffusion Models to have the disentangled latent spaces and how to naturally incorporate the disentangled conditions during the sampling process have been underexplored. In this paper, we present a training framework for feature disentanglement of Diffusion Models (FDiff). We further propose two sampling methods that can boost the realism of our Diffusion Models and also enhance the controllability. Concisely, we train Diffusion Models conditioned on two latent features, a spatial content mask, and a flattened style embedding. We rely on the inductive bias of the denoising process of Diffusion Models to encode pose/layout information in the content feature and semantic/style information in the style feature. Regarding the sampling methods, we first generalize Composable Diffusion Models (GCDM) by breaking the conditional independence assumption to allow for some dependence between conditional inputs, which is shown to be effective in realistic generation in our experiments. Second, we propose timestep-dependent weight scheduling for content and style features to further improve the performance. We also observe better controllability of our proposed methods compared to existing methods in image manipulation and image translation.

Problem

Research questions and friction points this paper is trying to address.

Enhancing controllability of Diffusion Models via feature disentanglement

Improving realism and controllability with novel sampling methods

Training models with spatial content masks and style embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature disentanglement in Diffusion Models training

Generalized Composable Diffusion Models sampling

Timestep-dependent weight scheduling for features

🔎 Similar Papers

Diffusion Models: A Comprehensive Survey of Methods and Applications