Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing studies lack systematic evaluation of vision-language models (VLMs) under image-text mismatch scenarios, hindering understanding of their modality preferences and reasoning mechanisms. Method: We construct a cross-domain image-text mismatch dataset covering five distinct tasks and propose a task-decomposed modality disentanglement framework that replaces end-to-end fusion with stepwise reasoning and explicit result integration. Contribution/Results: We empirically identify a dynamic reversal of modality preference—quantified for the first time—as task complexity increases, with preference differences spanning +56.8% to −74.4%. Our framework significantly improves reasoning consistency and modality fairness on high-difficulty tasks, reducing average bias by 32.7%. This work establishes a novel paradigm and provides empirical grounding for analyzing and regulating multimodal alignment in VLMs.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.
Problem

Research questions and friction points this paper is trying to address.

Analyzing VLMs' biases in conflicting image-text scenarios
Investigating modality preference shifts with query complexity
Exploring mitigation strategies for bias in VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing biases in VLMs with conflicting image-text cues
Creating datasets with mismatched image-text pairs
Exploring prompt modifications and task decomposition strategies
🔎 Similar Papers
No similar papers found.