๐ค AI Summary
This work addresses a critical gap in the evaluation of vision-language models (VLMs): their inability to abstain from answering when confronted with conflicting image-text evidence or missing knowledgeโa capability essential for reliable deployment. To overcome the limitations of static benchmarks, which rapidly become outdated as training data scales, the authors propose a dynamic benchmark construction methodology. Leveraging multimodal retrieval and dependency-aware sample filtering followed by dynamic cleansing, they introduce VLM-DeflectionBench, comprising 2,775 carefully curated instances. A fine-grained evaluation protocol across four distinct scenarios is designed to systematically assess a modelโs deferral behavior under evidential insufficiency or contradiction. Experiments on 20 prominent VLMs reveal that most struggle to abstain appropriately, exposing significant reliability shortcomings and advancing the development of robust, knowledge-aware visual question answering evaluation frameworks.
๐ Abstract
Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.