What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the opacity of visual-language models’ (VLMs’) decision-making processes, which hinders their deployment in high-stakes applications. To this end, we propose NOTICE—the first noise-free, joint vision-language causal attribution framework. NOTICE integrates semantic-minimal-pair (SMP) image perturbations with symmetric token replacement (STR), enabling coordinated, semantics-preserving interventions across modalities. Coupled with causal mediation analysis and cross-attention head attribution, it systematically identifies critical intermediate-layer attention heads in mainstream VLMs (e.g., BLIP). Experiments demonstrate that these heads exhibit both cross-task generalizability and functional specialization—such as implicit image segmentation and object suppression—thereby significantly enhancing decision interpretability and transparency on benchmark datasets including SVO-Probes, MIT-States, and facial expression recognition tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

Mechanistic interpretability in Vision-Language Models
Noise-free text-image corruption and evaluation
Universal cross-attention heads in VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-free Text-Image Corruption
Semantic Minimal Pairs framework
Symmetric Token Replacement