Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current text-to-image (T2I) models suffer from insufficient generative diversity and lack robust, interpretable evaluation methods. To address this, we propose the first systematic benchmark for concept-level diversity assessment: it employs attribute-controllable prompt sets and a standardized human evaluation protocol, quantifies inter-sample semantic divergence using multi-source image embeddings (e.g., CLIP, DINO), and applies binomial testing to enable statistically significant diversity ranking across models. Our key innovation lies in evaluating diversity at the fine-grained semantic concept level—thereby identifying category-specific generation biases. The resulting reproducible benchmark effectively discriminates diversity performance among leading models—including Stable Diffusion, SDXL, and DALL·E 3—providing empirical grounding and methodological guidance for model improvement and diversity metric design.

Technology Category

Application Category

📝 Abstract

Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating diversity in text-to-image model outputs

Developing human evaluation framework for nuanced diversity assessment

Comparing image embeddings and ranking models by diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human evaluation template for nuanced diversity assessment

Curated prompt set with identified variation factors

Methodology comparing models via binomial tests

🔎 Similar Papers

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation