Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the limited interpretability and poor alignment with human visual concepts of cross-attention mechanisms in text-to-image diffusion models. We propose Head Relevance Vectors (HRVs)—learnable, semantically aligned importance scores for attention heads. By systematically ablating attention heads, we identify critical ones and develop an HRV-guided strategy for concept enhancement and refinement, enabling the first fine-grained semantic control *across attention heads*. Our method significantly mitigates ambiguous word misgeneration (e.g., “apple” erroneously rendered as “orange”), successfully edits five long-standing challenging attributes (e.g., material, pose, illumination), and effectively suppresses catastrophic neglect in multi-concept generation. This work establishes a novel, interpretable, and intervention-friendly paradigm for controllable image generation.

Technology Category

Application Category

📝 Abstract

Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.

Problem

Research questions and friction points this paper is trying to address.

Interpret cross-attention layers in diffusion models.

Align visual concepts with cross-attention head patterns.

Enhance image generation and editing using HRVs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention layers interpretation

Head Relevance Vectors construction

Concept strengthening methods application

🔎 Similar Papers

No similar papers found.