🤖 AI Summary
Understanding how attention heads in multimodal Transformers specialize in processing semantic versus visual attributes remains challenging.
Method: We propose a signal-processing–inspired intermediate activation reconstruction technique to systematically probe, rank, and interpret the functional roles of individual attention heads.
Contribution/Results: We discover that editing only ~1% of critical attention heads—identified via our method—enables precise suppression or enhancement of specific concepts (e.g., question-answering outputs, toxic expressions, image categories, or descriptive attributes) across modalities. This reveals a sparse, spatially localizable, and cross-task consistent controllable structure within the model. Our approach is validated on diverse tasks including open-domain QA, toxicity mitigation, image classification, and image captioning, demonstrating robust generalization. It establishes a new paradigm for controllable editing and mechanistic interpretation of large multimodal models, bridging interpretability with practical intervention.
📝 Abstract
Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.