Head Pursuit: Probing Attention Specialization in Multimodal Transformers

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Understanding how attention heads in multimodal Transformers specialize in processing semantic versus visual attributes remains challenging. Method: We propose a signal-processing–inspired intermediate activation reconstruction technique to systematically probe, rank, and interpret the functional roles of individual attention heads. Contribution/Results: We discover that editing only ~1% of critical attention heads—identified via our method—enables precise suppression or enhancement of specific concepts (e.g., question-answering outputs, toxic expressions, image categories, or descriptive attributes) across modalities. This reveals a sparse, spatially localizable, and cross-task consistent controllable structure within the model. Our approach is validated on diverse tasks including open-domain QA, toxicity mitigation, image classification, and image captioning, demonstrating robust generalization. It establishes a new paradigm for controllable editing and mechanistic interpretation of large multimodal models, bridging interpretability with practical intervention.

Technology Category

Application Category

📝 Abstract

Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

Problem

Research questions and friction points this paper is trying to address.

Analyzes attention head specialization in multimodal transformers

Develops method to rank heads by concept relevance

Enables targeted model editing through selective head manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinterpreting attention probing via signal processing

Ranking heads by relevance to target concepts

Editing few heads to control model output

🔎 Similar Papers

No similar papers found.