๐ค AI Summary
This work addresses the limited interpretability and poor alignment with human visual concepts of cross-attention mechanisms in text-to-image diffusion models. We propose Head Relevance Vectors (HRVs)โlearnable, semantically aligned importance scores for attention heads. By systematically ablating attention heads, we identify critical ones and develop an HRV-guided strategy for concept enhancement and refinement, enabling the first fine-grained semantic control *across attention heads*. Our method significantly mitigates ambiguous word misgeneration (e.g., โappleโ erroneously rendered as โorangeโ), successfully edits five long-standing challenging attributes (e.g., material, pose, illumination), and effectively suppresses catastrophic neglect in multi-concept generation. This work establishes a novel, interpretable, and intervention-friendly paradigm for controllable image generation.
๐ Abstract
Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.