Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

📅 2024-02-18

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenge of entangled multi-concept generation—particularly for human subjects—and poor controllability of inter-concept interactions in text-to-image diffusion models. To this end, we propose a concept-driven personalized generation framework. Methodologically, we introduce an EM-style alternating optimization mechanism that jointly learns customized text tokens and latent segmentation masks to achieve effective concept disentanglement. Our approach integrates cross-attention modeling within the U-Net architecture, DenseCRF-based post-processing, and latent-space diffusion, enabling few-shot concept injection and mask-concept co-learning. Experiments demonstrate that the framework reliably composes three or more entangled concepts with high fidelity, significantly improving concept preservation, structural consistency, and fine-grained controllability in both qualitative and quantitative evaluations. This establishes a novel paradigm for controllable, multi-concept collaborative generation in diffusion-based image synthesis.

Technology Category

Application Category

📝 Abstract

Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.

Problem

Research questions and friction points this paper is trying to address.

Generate images with multiple interacting visual concepts

Disentangle entangled concepts in user-provided image illustrations

Improve token learning for concepts via joint mask optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint learning of custom tokens and segmentation masks

EM-like optimization for concept disentanglement

Cross-attention based latent mask generation

🔎 Similar Papers

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance