A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

📅 2025-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal foundation models (MMFMs) suffer from insufficient mechanistic interpretability, and their fundamental differences from unimodal large language models (LLMs) remain poorly understood. Method: We propose the first structured taxonomy of MMFM interpretability methods, systematically integrating attribution analysis, feature disentanglement, neuron activation tracing, concept activation vectors (CAVs), and module-level interventions across contrastive, generative, and text-to-image architectures. Contribution/Results: Our analysis uncovers core mechanistic disparities in cross-modal interaction—particularly in representation alignment, information bottlenecks, and gradient propagation—distinguishing MMFMs from unimodal LLMs. We identify critical bottlenecks in current MMFM interpretability and introduce the first unified evaluation framework. This work provides both theoretical foundations and practical guidelines for designing, diagnosing, and optimizing trustworthy multimodal AI systems.

Technology Category

Application Category

📝 Abstract
The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.
Problem

Research questions and friction points this paper is trying to address.

Adapt LLM interpretability methods to multimodal models
Understand mechanistic differences between unimodal and crossmodal systems
Highlight research gaps in multimodal foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt LLM interpretability to MMFMs
Compare unimodal and multimodal differences
Taxonomy of MMFM interpretability methods
🔎 Similar Papers
No similar papers found.
Zihao Lin
Zihao Lin
University of California, Davis, Department of Computer Science
Multi-modal Representation LearningNatural Language Processing
Samyadeep Basu
Samyadeep Basu
Research Scientist at Adobe Research | Prev: UMD, MSR
Machine LearningInfluence FunctionsInterpretabilityFew-shot learning
M
Mohammad Beigi
UC Davis
Varun Manjunatha
Varun Manjunatha
Senior Research Scientist, Adobe Research
CVNLPLLMs
Ryan A. Rossi
Ryan A. Rossi
Adobe Research
Machine LearningPersonalizationGraph Representation LearningGraph MLGraph Theory
Zichao Wang
Zichao Wang
Adobe Research
document AIAI for educationnatural language processingmachine learning
Y
Yufan Zhou
Adobe
S. Balasubramanian
S. Balasubramanian
Associate Professor, SSSIHL
Computer VisionMachine Learning
Arman Zarei
Arman Zarei
University of Maryland, College Park
Image GenerationGenerative ModelsComputer VisionMachine Learning
Keivan Rezaei
Keivan Rezaei
Ph.D. Student, University of Maryland
Knowledge LocalizationUnlearningModel EditingInterpretability
Y
Ying Shen
UIUC
Barry Menglong Yao
Barry Menglong Yao
PhD Student, Virginia Tech
LLM for machine teachingentity linkingfake news detection
Z
Zhiyang Xu
Virginia Tech
Q
Qin Liu
UC Davis
Y
Yuxiang Zhang
Waseda University
Y
Y. Sun
University of Sydney
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
L
Li Shen
Sun Yat-Sen University
H
Hongxuan Li
Duke University
S
S. Feizi
University of Maryland
Lifu Huang
Lifu Huang
Assistant Professor, UC Davis
Natural Language ProcessingMultimodal LearningAI for ScienceMultilingual