🤖 AI Summary
Current text-to-image (T2I) evaluation overemphasizes image fidelity and superficial text–image alignment, neglecting deep semantic understanding grounded in world knowledge—e.g., cultural commonsense, spatiotemporal reasoning, and natural science principles.
Method: We introduce WISE, the first semantic evaluation benchmark explicitly designed for world-knowledge integration: it spans 25 subdomains and comprises 1,000 carefully crafted prompts; proposes a novel knowledge-aware semantic evaluation paradigm; and introduces WiScore—a differentiable, quantitative metric that jointly leverages knowledge graph retrieval and fine-grained visual-semantic alignment, outperforming CLIP significantly. We further establish a unified cross-model evaluation framework.
Results: Comprehensive evaluation of 20 state-of-the-art T2I and multimodal models reveals a systemic deficiency in world knowledge modeling. The WISE benchmark, along with code and dataset, is publicly released to advance knowledge-enhanced generative modeling.
📝 Abstract
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $ extbf{WISE}$, the first benchmark specifically designed for $ extbf{W}$orld Knowledge-$ extbf{I}$nformed $ extbf{S}$emantic $ extbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $ extbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.