Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

📅 2024-09-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses image hallucination in text-to-image (TTI) generation—i.e., violations of factual consistency between generated images and their textual prompts—by introducing I-HallA, the first automated framework for factual consistency evaluation, along with the benchmark I-HallA v1.0. Methodologically, it proposes a novel visual question answering (VQA)-based paradigm to quantify factual consistency, integrating GPT-4 Omni multi-agent generation of high-quality question-answer pairs, human verification, and VQA model inference; Spearman correlation (ρ = 0.95) confirms strong agreement with human judgments. The benchmark comprises 1.2K image–text–question triplets, spanning diverse categories and emphasizing compositional reasoning challenges. Extensive evaluation across five state-of-the-art TTI models reveals pervasive factual inconsistencies. Both code and dataset are publicly released to advance research on factually controllable image generation.

Technology Category

Application Category

📝 Abstract
Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation ($ ho$=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI generation models. Additional resources can be found on our project page: https://sgt-lim.github.io/I-HallA/.
Problem

Research questions and friction points this paper is trying to address.

Assessing factual accuracy in text-to-image models
Introducing I-HallA for image hallucination evaluation
Validating metric reliability through human judgment correlation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evaluation metric I-HallA
GPT-4 Omni-based question generation
Benchmark dataset I-HallA v1.0
🔎 Similar Papers
No similar papers found.