Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

📅 2024-04-25
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
Systematic validation of prompt–image alignment evaluation in text-to-image (T2I) generation remains lacking; existing automatic metrics have not been rigorously assessed for quality, reliability, or cross-metric comparability against human judgments. Method: We propose the first comprehensive framework for alignment evaluation—introducing a skill-graded benchmark and a large-scale, multi-template human evaluation dataset with over 100K annotations; designing a skill-driven prompt taxonomy and a multi-round consistency scoring protocol; and developing QA-Metric, a question-answering–based automatic metric aligned with human judgment. Contribution/Results: Experiments demonstrate that QA-Metric significantly outperforms state-of-the-art methods on both our benchmark and TIFA160. Moreover, our analysis uncovers, for the first time, intrinsic connections among prompt ambiguity, model bias, and metric bias—revealing critical limitations in current alignment assessment paradigms.

Technology Category

Application Category

📝 Abstract
While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of>100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
Problem

Research questions and friction points this paper is trying to address.

Evaluate text-to-image model alignment
Assess reliability of human-rated prompts
Introduce improved auto-evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Skills-based benchmark for T2I models
Extensive human annotations analysis
QA-based auto-eval metric improvement