How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the layer-wise dependency mechanisms of multimodal Transformers in visual question answering with respect to visual, linguistic, and cross-modal information. The authors propose PID Flow, a layer-wise information decomposition framework based on Partial Information Decomposition (PID), which integrates dimensionality reduction, normalizing flow-based Gaussianization, and closed-form Gaussian PID estimation to disentangle high-dimensional representations. Experiments on LLaVA-1.5/1.6-7B reveal a consistent “modality transduction” pattern: visual information rapidly decays in early layers, linguistic signals dominate final-layer predictions, and cross-modal synergy remains limited. Causal relationships are further established through attention ablation studies, which quantify information loss bottlenecks and confirm the robustness of this pattern across tasks.

Technology Category

Application Category

📝 Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

vision-language models

information decomposition

cross-modal synergy

layer-wise analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial Information Decomposition

Multimodal Transformers

Modal Transduction