Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This study addresses critical blind spots in current evaluator vision-language models (Evaluator VLMs) when detecting fine-grained errors—such as object hallucination, spatial misalignment, and factual inconsistency—in image–text pairs. The authors introduce the first comprehensive benchmark comprising over 4,000 samples spanning 40 distinct perturbation types, systematically evaluating three assessment paradigms: single-answer scoring, pairwise comparison, and reference-guided evaluation. Experimental results reveal that state-of-the-art Evaluator VLMs exhibit omission rates exceeding 50% on perturbed outputs, demonstrating particular vulnerability to spatial and hallucinatory errors. While pairwise comparison proves relatively more robust, it still suffers from notable limitations. This work establishes a new benchmark and provides empirical evidence to advance the reliability of VLM evaluation.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluator VLMs
reliability
blind spots
hallucination
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluator Vision-Language Models
targeted perturbations
reliability evaluation
hallucination detection
benchmarking