VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Current vision-language models (VLMs) employed as multimodal evaluators lack reliable indicators of confidence, making it difficult to quantify the certainty of their scores. This work presents the first systematic analysis of uncertainty in VLM-based evaluation and introduces a training-free conformal prediction method that transforms point-estimate scores into calibrated prediction intervals using the log-probabilities of scoring tokens. The study uncovers task-dependent uncertainty patterns and a “ranking–scoring decoupling” phenomenon: high ranking correlation often coexists with overly wide, uninformative absolute score intervals. Experiments establish a reliability atlas for multimodal evaluation—prediction intervals span approximately 40% in aesthetic tasks but widen to 70% in chart and mathematical reasoning tasks; notably, on high-quality annotated data, interval widths can be reduced by up to 4.5×.

📝 Abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Multimodal Evaluation

Conformal Prediction

Evaluation Uncertainty

Automated Judging

Innovation

Methods, ideas, or system contributions that make the work stand out.

conformal prediction

vision-language models

multimodal evaluation