🤖 AI Summary
Existing diffusion models for text-to-image generation often suffer from imprecise semantic alignment due to dispersed attention activations across synonymous subject tokens and overlapping attention regions among distinct subject tokens within the cross-attention mechanism. To address these issues, this work proposes an aggregation-and-isolation cross-attention framework that simultaneously enhances alignment fidelity by introducing two complementary losses: an aggregation loss to consolidate attention activations of synonymous subject tokens, and an isolation loss to disentangle attention regions corresponding to different subject tokens. This approach is the first to jointly resolve both types of alignment deficiencies, achieving state-of-the-art performance across multiple benchmarks. It significantly improves semantic consistency and controllability in generated images and demonstrates strong generalization capabilities in layout-controlled and personalized image generation tasks.
📝 Abstract
Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.