WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

πŸ“… 2025-05-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-image models exhibit limited performance on prompts requiring world knowledge and implicit reasoning, hindering their real-world applicability. To address this, we introduce the first reasoning-driven benchmark integrating humanities and natural science knowledge, accompanied by the Knowledge Checklist Scoreβ€”a novel quantitative metric systematically evaluating cross-domain implicit reasoning capabilities, including counterfactual, causal, and cultural metaphorical reasoning. Methodologically, we propose a semantic-consistency-based knowledge assessment framework coupled with multi-dimensional prompt design, enabling a comprehensive horizontal evaluation of 21 state-of-the-art models. Empirical results reveal that closed-source autoregressive models (e.g., GPT-4o) significantly outperform open-source diffusion models in knowledge integration and logical reasoning. This work establishes a reproducible evaluation standard and identifies key improvement pathways toward cognitively enhanced text-to-image systems.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce extbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the extbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: href{https://dwanzhang-ai.github.io/WorldGenBench/}{https://dwanzhang-ai.github.io/WorldGenBench/}
Problem

Research questions and friction points this paper is trying to address.

Evaluating T2I models' world knowledge and reasoning
Measuring semantic accuracy with Knowledge Checklist Score
Comparing performance of diffusion and auto-regressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldGenBench evaluates T2I world knowledge
Knowledge Checklist Score measures semantic accuracy
Proprietary models show stronger reasoning capabilities
πŸ”Ž Similar Papers
No similar papers found.
Daoan Zhang
Daoan Zhang
PhD Student, University of Rochester
Computer VisionMultimodal LearningLLM
Che Jiang
Che Jiang
Tsinghua University
R
Ruoshi Xu
Southern University of Science and Technology
B
Biaoxiang Chen
Southern University of Science and Technology
Zijian Jin
Zijian Jin
New York University
NLP
Y
Yutian Lu
Datawhale org.
J
Jianguo Zhang
Southern University of Science and Technology
L
Liang Yong
Chinese Medicine Guangdong Laboratory
J
Jiebo Luo
University of Rochester
Shengda Luo
Shengda Luo
Southern University of Science and Technology
AI for ScienceComputer Vision