🤖 AI Summary
This work identifies systematic impacts of label errors in the MSCOCO dataset on the POPE benchmark for object hallucination evaluation. We manually re-annotate all POPE images, revealing substantial subset bias in the original annotations: a 12.3% error rate on COCO-val versus only 3.1% on test-dev—causing severe distortion in model rankings. Based on this analysis, we construct RePOPE, a high-quality re-annotated benchmark. Evaluating 12 state-of-the-art vision-language models on RePOPE yields significantly altered performance rankings (e.g., BLIP-2 exhibits a rank shift of four positions), confirming that annotation quality is a critical prerequisite for benchmark validity. To our knowledge, this is the first diagnostic and correction framework specifically designed for annotation errors in object hallucination benchmarks. It establishes both a methodological foundation and practical standards for trustworthy multimodal evaluation.
📝 Abstract
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .