KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

📅 2024-10-15

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

186K/year

🤖 AI Summary

Current text-to-image evaluation paradigms emphasize aesthetic quality or coarse semantic alignment, neglecting fine-grained fidelity to real-world visual entities (e.g., landmarks, flora, fauna). Method: We introduce KITTEN—the first knowledge-intensive benchmark for entity-level factual fidelity—comprising a multi-source, knowledge-grounded test set; a dual-track evaluation framework combining CLIPScore and DINO-based entity similarity metrics with expert-designed fine-grained human annotation protocols; and systematic assessment of state-of-the-art diffusion and retrieval-augmented models. Results: We find that mainstream models exhibit significant detail distortion across 76% of entity categories. While retrieval augmentation improves fidelity, it suppresses prompt creativity by 32%, exposing a fundamental trade-off between factual accuracy and generative flexibility. KITTEN establishes a reproducible, knowledge-driven evaluation standard and actionable guidance for advancing factually grounded image generation.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully-designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entity by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-image models' accuracy in representing real-world visual entities

Assessing models' ability to generate diverse realistic entities like landmarks and animals

Analyzing retrieval-augmented models' over-reliance on reference images for entity fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes KITTEN benchmark for entity image generation

Evaluates text-to-image and retrieval-augmented models

Combines human, automatic, and MLLM evaluation methods

🔎 Similar Papers

No similar papers found.