The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) excel at complex visual understanding but lack systematic evaluation of uncertainty quantification. Method: We introduce the first large-scale multimodal uncertainty benchmark covering both open- and closed-source VLMs, spanning six challenging datasets—including scientific understanding and mathematical reasoning tasks—and employ conformal prediction with three distinct scoring functions for calibration and evaluation. Contribution/Results: Experiments reveal a strong positive correlation between model scale and uncertainty estimation quality; higher-confidence predictions exhibit greater accuracy; and mathematical and reasoning tasks consistently yield higher uncertainty. We propose the first reliability assessment framework tailored to complex visual reasoning, demonstrating that a model’s factual knowledge strongly correlates with its ability to calibrate epistemic boundaries. This work establishes both theoretical foundations and practical benchmarks for trustworthy multimodal AI.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking uncertainty quantification in Vision-Language Models
Evaluating 16 VLMs across 6 multimodal datasets
Establishing foundation for reliable multimodal uncertainty evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive uncertainty benchmarking across VLMs
Evaluating 16 models with 3 scoring functions
Larger models show better uncertainty quantification