Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the lack of evaluation frameworks for zero-shot text-to-image models’ ability to concretize taxonomic concepts, introducing the first zero-shot image generation benchmark grounded in the WordNet hierarchy. Methodologically, we propose nine taxonomy-aware automatic evaluation metrics and a novel GPT-4–driven pairwise human feedback mechanism, integrated with lexicographic sampling, LLM-assisted prediction, and multidimensional quantitative analysis. Results reveal that model rankings diverge significantly from those observed in standard text-to-image benchmarks: Playground-v2 and FLUX consistently outperform all others across all subsets and metrics, whereas retrieval-based methods yield the poorest performance. This study provides the first systematic empirical validation of text-to-image models’ capacity to comprehend and visually realize structured semantic knowledge. It establishes a reproducible evaluation paradigm and evidence-based foundation for automating high-quality taxonomic visual resource construction.

Technology Category

Application Category

📝 Abstract

This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.

Problem

Research questions and friction points this paper is trying to address.

Assessing text-to-image models for taxonomy concept generation.

Evaluating models using novel taxonomy-related metrics and human feedback.

Exploring automation potential for structured data resource curation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot text-to-image models for taxonomy concepts

Comprehensive benchmark with novel taxonomy-related metrics

Pairwise evaluation using GPT-4 feedback for images

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis