A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies severe design biases in 17 mainstream vision-language compositional reasoning benchmarks (e.g., SugarCREPE, VALSE), stemming from asymmetric data distributions—particularly in source datasets (e.g., MS-COCO) and positive/negative sample construction—which inflate model performance estimates. Simple heuristics (e.g., token-length or log-likelihood baselines) achieve performance on par with state-of-the-art models like CLIP, thereby overstating true compositional understanding capabilities. Method: We conduct systematic benchmark analysis, statistical bias detection, and cross-benchmark consistency validation to diagnose distributional imbalance at the benchmark level. Contribution/Results: We are the first to formally attribute benchmark-level distributional skew as the root cause of brittle evaluation. We propose principled guidelines for robust, interference-resistant benchmark design and provide actionable reconstruction protocols. Our findings expose widespread vulnerability to blind heuristic attacks across existing benchmarks and establish both theoretical foundations and practical pathways for building trustworthy vision-language model evaluation frameworks.

Technology Category

Application Category

📝 Abstract
We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuristics (e.g. token-length, log-likelihood under a language model) perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding. We demonstrate that the underlying factor is a distribution asymmetry between positive and negative images/captions, induced by the benchmark construction procedures. To mitigate these issues, we provide a few key recommendations for constructing more robust vision-language compositional understanding benchmarks, that would be less prone to such simple attacks.
Problem

Research questions and friction points this paper is trying to address.

Investigating biases in vision-language benchmarks design
Evaluating ineffective compositional understanding measurement in VLMs
Addressing distribution asymmetry in benchmark construction procedures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing biases in vision-language benchmarks
Identifying distribution asymmetry in data
Recommending robust benchmark construction methods