CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing text-to-image (T2I) evaluation metrics lack systematic robustness validation; human meta-evaluation is costly, and automated alternatives are scarce. Method: We propose CROC, a scalable contrastive robustness testing framework that synthesizes a million-scale pseudo-labeled dataset (CROCˢʸⁿ) and a human-annotated challenge benchmark (CROCʰᵘᵐ), via controlled image perturbations and prompt–image contrastive sampling across the full attribute spectrum. Contribution/Results: Our analysis reveals systematic failures of mainstream metrics on critical dimensions—including negation handling and human body part recognition—with over 25% failure rates. Leveraging these insights, we introduce CROCScore, a new metric achieving state-of-the-art performance among open-source T2I evaluators. CROCScore integrates robustness-driven data synthesis, supervised/semi-supervised metric learning, and fine-grained attribute classification—establishing the first reproducible, scalable meta-evaluation paradigm for T2I assessment.

Technology Category

Application Category

📝 Abstract

The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.

Problem

Research questions and friction points this paper is trying to address.

Assessing text-to-image metric robustness via automated contrastive checks

Generating pseudo-labeled dataset for fine-grained metric comparison

Training a new state-of-the-art evaluation metric (CROCScore)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Contrastive Robustness Checks framework

Generates pseudo-labeled dataset for metric comparison

Trains CROCScore metric with state-of-the-art performance

🔎 Similar Papers

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings