Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

๐Ÿ“… 2024-11-28
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion models frequently suffer from subject confusion in multi-subject image generation, especially when subjects exhibit similar appearances. To address this, we propose Self-Cross Diffusion Guidanceโ€”a training-free mechanism that enforces spatial alignment between cross-attention maps and aggregated self-attention maps, enabling precise, holistic separation of subjects (beyond discriminative local regions). This is the first method to jointly leverage self- and cross-attention guidance for confusion suppression. We introduce a novel benchmark tailored for visually similar subjects and a GPT-4oโ€“driven automatic evaluation protocol. Our approach is plug-and-play, compatible with any text-to-image diffusion model (e.g., Stable Diffusion). Extensive quantitative and qualitative experiments demonstrate substantial reductions in subject confusion rates compared to state-of-the-art methods, with strong robustness and generalization across diverse prompts and model backbones.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have achieved unprecedented fidelity and diversity for synthesizing image, video, 3D assets, etc. However, subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross diffusion guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our self-cross guidance is more effective in eliminating subject mixing. What's more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to. Our method is training-free and can boost the performance of any transformer-based diffusion model such as Stable Diffusion.% for synthesizing similar subjects. We also release a more challenging benchmark with many text prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross guidance.
Problem

Research questions and friction points this paper is trying to address.

Resolves subject mixing in diffusion-based image synthesis
Penalizes overlap between cross-attention and self-attention maps
Improves synthesis of multiple similar-looking subjects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Cross Diffusion Guidance penalizes attention overlap
Aggregates self-attention maps for whole subject regions
Training-free method enhances Unet and Transformer models
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Weimin Qiu
University of California Merced
J
Jieke Wang
University of California Merced
Meng Tang
Meng Tang
Assistant Professor, University of California, Merced
computer visionmachine learningoptimization