🤖 AI Summary
This work systematically investigates the robustness of vision-language models (VLMs) under multimodal knowledge conflicts, focusing on three novel conflict types: parameter conflict, source conflict, and counterfactual conflict. To this end, we propose KOALA—a framework that generates controllable conflict samples via image-source-aware targeted perturbations—and introduce the first benchmark for evaluating VLMs under multimodal knowledge conflicts. Experimental results reveal that while VLMs exhibit strong robustness to image perturbations, they severely fail under counterfactual (accuracy <30%) and source conflicts (<1%). Moreover, image context significantly exacerbates hallucination in large models (e.g., GPT-4o’s hallucination rate increases markedly on high-context counterfactual samples). Lightweight fine-tuning substantially improves counterfactual reasoning capability. This study is the first to uncover fundamental bottlenecks in VLMs’ multimodal conflict reasoning, providing both theoretical insights and practical guidelines for designing robust VLMs.
📝 Abstract
The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (<30% accuracy) and fail to reason over source conflicts (<1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.