Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual language models (VLMs) remain largely opaque, with limited understanding of how individual attention heads contribute to multimodal reasoning. Method: We introduce CogVision—a multi-granularity, chain-of-reasoning dataset—and integrate probe-based analysis, causal intervention, and cross-architecture functional localization to systematically dissect attention head functionality in VLMs. Contribution/Results: We discover, for the first time, a sparse, functionally specialized, and hierarchically interacting modular structure among attention heads—where distinct heads consistently specialize in visual perception, cross-modal alignment, and logical reasoning, mirroring human-like cognitive division of labor. This pattern is robustly replicated across multiple mainstream VLM families (e.g., LLaVA, Qwen-VL, InternVL). Causal ablation demonstrates that removing such specialized heads degrades multimodal reasoning performance by 12.3% on average, while targeted enhancement improves it by 5.7%, confirming their causal necessity and functional centrality in multimodal inference.

Technology Category

Application Category

📝 Abstract
Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.
Problem

Research questions and friction points this paper is trying to address.

Analyzes attention heads' functional roles in vision-language models
Investigates multimodal reasoning mechanisms using a novel interpretability framework
Explores cognitive organization to enhance human-aligned perceptual abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing-based method identifies specialized attention heads
CogVision dataset simulates human reasoning step-by-step
Intervention experiments show functional heads' critical role
🔎 Similar Papers
No similar papers found.