Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the pronounced vulnerability of current vision-language models (VLMs) to misleading textual inputs that contradict visual content, often causing them to disregard visual evidence in favor of erroneous text. To systematically investigate VLM robustness under such image-text conflicts, we introduce CONTEXT-VQA—the first multimodal benchmark dataset specifically designed to evaluate susceptibility to textual misinformation—and propose a comprehensive evaluation framework incorporating adversarial text generation, multi-turn dialogue simulation, and visual question answering (VQA) assessment. Through systematic stress testing of 11 state-of-the-art VLMs, we observe an average performance drop exceeding 48.2%, revealing a critical overreliance on textual cues during multimodal reasoning and highlighting a fundamental limitation in existing approaches.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

textual misinformation

multimodal reasoning

Visual Question Answering

model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Textual Misinformation

Multimodal Robustness