Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

๐Ÿ“… 2024-12-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited interpretability and poor alignment with human visual concepts of cross-attention mechanisms in text-to-image diffusion models. We propose Head Relevance Vectors (HRVs)โ€”learnable, semantically aligned importance scores for attention heads. By systematically ablating attention heads, we identify critical ones and develop an HRV-guided strategy for concept enhancement and refinement, enabling the first fine-grained semantic control *across attention heads*. Our method significantly mitigates ambiguous word misgeneration (e.g., โ€œappleโ€ erroneously rendered as โ€œorangeโ€), successfully edits five long-standing challenging attributes (e.g., material, pose, illumination), and effectively suppresses catastrophic neglect in multi-concept generation. This work establishes a novel, interpretable, and intervention-friendly paradigm for controllable image generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.
Problem

Research questions and friction points this paper is trying to address.

Interpret cross-attention layers in diffusion models.
Align visual concepts with cross-attention head patterns.
Enhance image generation and editing using HRVs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention layers interpretation
Head Relevance Vectors construction
Concept strengthening methods application
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jungwon Park
Department of Intelligence and Information, Seoul National University
J
Jungmin Ko
Interdisciplinary Program in Artificial Intelligence, Seoul National University
D
Dongnam Byun
Department of Intelligence and Information, Seoul National University
J
Jangwon Suh
Department of Intelligence and Information, Seoul National University
Wonjong Rhee
Wonjong Rhee
Seoul National University
Deep Learning TheoryArtificial IntelligenceInformation Theory