VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current vision-language models (VLMs) lack fine-grained quantification of uncertainty arising from visual, textual, and cross-modal sources, limiting their ability to reliably detect hallucinations and localized risks. To address this gap, this work introduces VLM-UQBench, a benchmark that systematically categorizes and annotates modality-specific and cross-modal uncertainty samples. Built upon VizWiz, it includes a fine-grained subset of 600 real-world instances and incorporates eight visual, five textual, and three cross-modal perturbation strategies. We propose novel metrics to evaluate the sensitivity of uncertainty scores to these perturbations and their correlation with hallucination. Experiments reveal that existing uncertainty quantification methods are heavily dependent on the underlying VLM architecture: while they capture ambiguity at the population level, they fail to discern subtle instance-level ambiguities and provide only weak early signals of hallucination.

Technology Category

Application Category

📝 Abstract

Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.

Problem

Research questions and friction points this paper is trying to address.

uncertainty quantification

vision-language models

modality-specific uncertainty

cross-modal uncertainty

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty Quantification

Vision-Language Models

Modality-Specific Uncertainty