Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

📅 2024-11-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the reasoning mechanisms and interpretability of the LLaVA multimodal large language model in visual question answering (VQA). Methodologically, it pioneers the systematic application of mechanistic interpretability techniques to multimodal models, integrating attention attribution, cross-modal feature alignment, instruction-tuning mechanism disentanglement, and visual embedding projection mapping to uncover vision–language coordination. Key contributions include: (1) identifying a context-learning–like VQA reasoning pattern in LLaVA; (2) proposing a novel, localizable attention attribution tool that significantly mitigates visual hallucination; (3) demonstrating superior attribution speed and accuracy over state-of-the-art baselines; (4) empirically validating that visual instruction tuning effectively enhances and preserves Vicuna’s textual capabilities; and (5) open-sourcing an interpretability toolkit supporting interactive, visualization-enabled analysis.

Technology Category

Application Category

📝 Abstract
Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: url{https://github.com/zepingyu0512/llava-mechanism}
Problem

Research questions and friction points this paper is trying to address.

Multimodal Language Models
Visual Question Answering
Transparency and Explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Visual Question Answering
Llava Model Enhancement
🔎 Similar Papers
No similar papers found.