🤖 AI Summary
This work addresses the lack of evaluation frameworks for zero-shot text-to-image models’ ability to concretize taxonomic concepts, introducing the first zero-shot image generation benchmark grounded in the WordNet hierarchy. Methodologically, we propose nine taxonomy-aware automatic evaluation metrics and a novel GPT-4–driven pairwise human feedback mechanism, integrated with lexicographic sampling, LLM-assisted prediction, and multidimensional quantitative analysis. Results reveal that model rankings diverge significantly from those observed in standard text-to-image benchmarks: Playground-v2 and FLUX consistently outperform all others across all subsets and metrics, whereas retrieval-based methods yield the poorest performance. This study provides the first systematic empirical validation of text-to-image models’ capacity to comprehend and visually realize structured semantic knowledge. It establishes a reproducible evaluation paradigm and evidence-based foundation for automating high-quality taxonomic visual resource construction.
📝 Abstract
This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.