Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

📅 2024-10-10
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent commonsense-level vision–knowledge conflict in multimodal large language models (MLLMs)—a scenario where image content contradicts the model’s parametric commonsense knowledge. We introduce VKBench, the first fine-grained diagnostic benchmark comprising 374 images and 1,122 high-quality QA pairs, systematically evaluating nine state-of-the-art MLLMs. To construct conflict-rich data, we propose a human-in-the-loop automated generation framework and a “Focus-on-Vision” prompting strategy. Experiments reveal that models over-rely on parametric knowledge in ~20% of queries—particularly in yes/no classification and action reasoning tasks—leading to erroneous predictions. While our methods partially mitigate such conflicts, fundamental resolution remains challenging. This is the first work to formally model, diagnose, and intervene in commonsense-level vision–knowledge conflicts, establishing a new benchmark and methodology for enhancing the reliability and trustworthiness of MLLM reasoning.

Technology Category

Application Category

📝 Abstract
This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model's internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs. Using this framework, we have crafted a diagnostic benchmark consisting of 374 original images and 1,122 high-quality question-answer (QA) pairs. The benchmark covers two aspects of conflict and three question types, providing a thorough assessment tool. We apply this benchmark to assess the conflict-resolution capabilities of nine representative MLLMs from various model families. Our results indicate an evident over-reliance on parametric knowledge for approximately 20% of all queries, especially among Yes-No and action-related problems. Based on these findings, we evaluate the effectiveness of existing approaches to mitigating the conflicts and compare them to our"Focus-on-Vision"prompting strategy. Despite some improvement, the vision-knowledge conflict remains unresolved and can be further scaled through our data construction framework. Our proposed framework, benchmark, and analysis contribute to the understanding and mitigation of vision-knowledge conflicts in MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigates vision-knowledge conflicts in Multimodal LLMs
Develops automated framework to simulate and evaluate conflicts
Assesses conflict-resolution in MLLMs using diagnostic benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework with human quality control
Diagnostic benchmark with images and QA pairs
Focus-on-Vision prompting strategy evaluation
🔎 Similar Papers
No similar papers found.
X
Xiaoyuan Liu
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China; Tencent AI Lab, China
W
Wenxuan Wang
Tencent AI Lab, China; The Chinese University of Hong Kong, Hong Kong SAR
Y
Youliang Yuan
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China; Tencent AI Lab, China
Jen-Tse Huang
Jen-Tse Huang
Johns Hopkins University
Artificial IntelligenceNatural Language ProcessingLarge Language Models
Qiuzhi Liu
Qiuzhi Liu
AI Lab, Tencent
Pinjia He
Pinjia He
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
Software EngineeringAI4SESE4AIAIOps
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation