🤖 AI Summary
Existing T2I evaluation benchmarks focus solely on explicit alignment between generated images and text prompts, neglecting consistency with implicit real-world knowledge. This work introduces ABP—the first benchmark explicitly designed to assess implicit world-knowledge alignment—covering six realistic scenarios and over 2,000 diverse prompts. We propose ABPScore, a training-free, automatic evaluation metric built upon multimodal large language models (MLLMs), and introduce Inference-Time Knowledge Injection (ITKI), a zero-shot, prompt-augmented strategy for knowledge-guided generation. On 200 challenging cases, ITKI improves knowledge-alignment performance by 43%. Comprehensive evaluation across eight state-of-the-art T2I models reveals a previously undocumented, widespread deficiency in grounding generations in foundational world knowledge. The ABP dataset, evaluation protocol, and implementation code are publicly released.
📝 Abstract
Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.