🤖 AI Summary
This work addresses the prevalent commonsense-level vision–knowledge conflict in multimodal large language models (MLLMs)—a scenario where image content contradicts the model’s parametric commonsense knowledge. We introduce VKBench, the first fine-grained diagnostic benchmark comprising 374 images and 1,122 high-quality QA pairs, systematically evaluating nine state-of-the-art MLLMs. To construct conflict-rich data, we propose a human-in-the-loop automated generation framework and a “Focus-on-Vision” prompting strategy. Experiments reveal that models over-rely on parametric knowledge in ~20% of queries—particularly in yes/no classification and action reasoning tasks—leading to erroneous predictions. While our methods partially mitigate such conflicts, fundamental resolution remains challenging. This is the first work to formally model, diagnose, and intervene in commonsense-level vision–knowledge conflicts, establishing a new benchmark and methodology for enhancing the reliability and trustworthiness of MLLM reasoning.
📝 Abstract
This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model's internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs. Using this framework, we have crafted a diagnostic benchmark consisting of 374 original images and 1,122 high-quality question-answer (QA) pairs. The benchmark covers two aspects of conflict and three question types, providing a thorough assessment tool. We apply this benchmark to assess the conflict-resolution capabilities of nine representative MLLMs from various model families. Our results indicate an evident over-reliance on parametric knowledge for approximately 20% of all queries, especially among Yes-No and action-related problems. Based on these findings, we evaluate the effectiveness of existing approaches to mitigating the conflicts and compare them to our"Focus-on-Vision"prompting strategy. Despite some improvement, the vision-knowledge conflict remains unresolved and can be further scaled through our data construction framework. Our proposed framework, benchmark, and analysis contribute to the understanding and mitigation of vision-knowledge conflicts in MLLMs.