🤖 AI Summary
Existing vision-language models (VLMs) excel at complex visual understanding but lack systematic evaluation of uncertainty quantification. Method: We introduce the first large-scale multimodal uncertainty benchmark covering both open- and closed-source VLMs, spanning six challenging datasets—including scientific understanding and mathematical reasoning tasks—and employ conformal prediction with three distinct scoring functions for calibration and evaluation. Contribution/Results: Experiments reveal a strong positive correlation between model scale and uncertainty estimation quality; higher-confidence predictions exhibit greater accuracy; and mathematical and reasoning tasks consistently yield higher uncertainty. We propose the first reliability assessment framework tailored to complex visual reasoning, demonstrating that a model’s factual knowledge strongly correlates with its ability to calibrate epistemic boundaries. This work establishes both theoretical foundations and practical benchmarks for trustworthy multimodal AI.
📝 Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.