How Do Vision-Language Models Process Conflicting Information Across Modalities?

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates cross-modal decision-making mechanisms in vision-language models (VLMs) under modality conflicts—e.g., an image of a dog paired with the caption “This is a cat.” We systematically construct conflict-rich multimodal samples and employ attention head localization, representation space analysis, and instruction-guided modality selection tasks. Our analysis reveals an intrinsic modality bias in VLMs and identifies two functionally distinct architectural components: (i) dedicated attention heads that regulate modality preference, and (ii) transferable, modality-agnostic “router heads” that dynamically route information across modalities. Crucially, targeted intervention on router heads significantly improves model accuracy in detecting multimodal consistency. This work provides the first empirical evidence of a hierarchical modality fusion architecture within VLMs, uncovering interpretable, controllable mechanisms for multimodal reasoning and cross-modal alignment.

Technology Category

Application Category

📝 Abstract
AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
Problem

Research questions and friction points this paper is trying to address.

Understand how vision-language models process conflicting multimodal inputs
Identify which modality models favor during information conflicts
Explore internal mechanisms controlling modality preference in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing modality bias in vision-language models
Identifying modality-specific attention heads
Manipulating router heads for improved performance
🔎 Similar Papers
No similar papers found.