RePOPE: Impact of Annotation Errors on the POPE Benchmark

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This work identifies systematic impacts of label errors in the MSCOCO dataset on the POPE benchmark for object hallucination evaluation. We manually re-annotate all POPE images, revealing substantial subset bias in the original annotations: a 12.3% error rate on COCO-val versus only 3.1% on test-dev—causing severe distortion in model rankings. Based on this analysis, we construct RePOPE, a high-quality re-annotated benchmark. Evaluating 12 state-of-the-art vision-language models on RePOPE yields significantly altered performance rankings (e.g., BLIP-2 exhibits a rank shift of four positions), confirming that annotation quality is a critical prerequisite for benchmark validity. To our knowledge, this is the first diagnostic and correction framework specifically designed for annotation errors in object hallucination benchmarks. It establishes both a methodological foundation and practical standards for trustworthy multimodal evaluation.

Technology Category

Application Category

📝 Abstract

Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .

Problem

Research questions and friction points this paper is trying to address.

Assess impact of MSCOCO label errors on POPE benchmark

Identify annotation error imbalance across benchmark subsets

Evaluate model ranking shifts due to label quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Re-annotate POPE benchmark images

Identify annotation errors imbalance

Evaluate models on revised labels

🔎 Similar Papers

SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization

2024-09-10International Conference on Computational LinguisticsCitations: 0

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

2024-07-15arXiv.orgCitations: 0