GRADE: Quantifying Sample Diversity in Text-to-Image Models

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the limited output diversity of text-to-image (T2I) models. We propose GRADE, the first framework integrating large language models (LLMs) and visual question answering (VQA) to automatically identify and quantify semantic diversity across concept-level attributes (e.g., shape, color) in generated images. GRADE leverages LLMs’ world knowledge to extract concept–attribute pairs, constructs their frequency distributions, and introduces normalized entropy as a semantics-driven diversity metric—achieving >90% agreement with human evaluations. Evaluation across 12 mainstream T2I models reveals severe homogenization (e.g., 98% of “cookie” generations are circular) and identifies underspecified training captions as the primary cause of low diversity. GRADE establishes the first interpretable, scalable, and highly consistent automated benchmark for assessing semantic diversity in T2I generation.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) models are remarkable at generating realistic images based on textual descriptions. However, textual prompts are inherently underspecified: they do not specify all possible attributes of the required image. This raises two key questions: Do T2I models generate diverse outputs on underspecified prompts? How can we automatically measure diversity? We propose GRADE: Granular Attribute Diversity Evaluation, an automatic method for quantifying sample diversity. GRADE leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' and ``color'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using (normalized) entropy. GRADE achieves over 90% human agreement while exhibiting weak correlation to commonly used diversity metrics. We use GRADE to measure the overall diversity of 12 T2I models using 400 concept-attribute pairs, revealing that all models display limited variation. Further, we find that these models often exhibit default behaviors, a phenomenon where the model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Finally, we demonstrate that a key reason for low diversity is due to underspecified captions in training data. Our work proposes a modern, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in outputs by T2I models.

Problem

Research questions and friction points this paper is trying to address.

Quantify sample diversity in text-to-image models

Identify concept-specific axes of diversity using large language models

Measure diversity and reveal limited variation in model outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large language models for diversity quantification

Uses entropy to measure concept and attribute distributions

Identifies underspecified captions as a diversity limitation

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis