ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing image generation evaluation benchmarks, which are often confined to single tasks or domains and lack interpretability in failure analysis. The authors introduce an open-world benchmark spanning six task categories and six real-world domains, comprising 3.6K condition sets and 20K fine-grained human annotations. They further propose the first explainable evaluation framework featuring object- and patch-level error annotations. Leveraging visual-language model (VLM)-based automatic assessment, multi-dimensional error categorization, and large-scale cross-model evaluation, the study conducts systematic stress tests on 14 state-of-the-art models. Results reveal that editing tasks significantly underperform compared to generation tasks, closed-source models generally outperform open ones, targeted training mitigates weaknesses in text-dense scenarios, and VLM-based metrics achieve up to 0.79 correlation with human judgments.
📝 Abstract
Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.
Problem

Research questions and friction points this paper is trying to address.

image generation
benchmarking
human evaluation
explainable evaluation
real-world tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

explainable evaluation
human annotation
image generation benchmark
fine-grained error analysis
real-world tasks
🔎 Similar Papers
No similar papers found.
S
Samin Mahdizadeh Sani
University of Waterloo
Max Ku
Max Ku
University of Waterloo
Generative ModelsComputer Vision
N
Nima Jamali
University of Waterloo
M
Matina Mahdizadeh Sani
University of Waterloo
P
Paria Khoshtab
Independent
W
Wei-Chieh Sun
G-G-G
Parnian Fazel
Parnian Fazel
Univeristy of Tehran
Zhi Rui Tam
Zhi Rui Tam
NTU / Appier
natural language processing
T
Thomas Chong
G-G-G
E
Edisy Kin Wai Chan
G-G-G
D
Donald Wai Tong Tsang
G-G-G
C
Chiao-Wei Hsu
G-G-G
T
Ting Wai Lam
G-G-G
H
Ho Yin Sam Ng
G-G-G
C
Chiafeng Chu
G-G-G
C
Chak-Wing Mak
G-G-G
Keming Wu
Keming Wu
Ph.D. Student, Tsinghua University
Computer VisionVision Language ModelsGenerative AI
H
Hiu Tung Wong
G-G-G
Y
Yik Chun Ho
G-G-G
C
Chi Ruan
University of Waterloo
Z
Zhuofeng Li
Independent
I
I-Sheng Fang
G-G-G
Shih-Ying Yeh
Shih-Ying Yeh
NTHU
Neural NetworkGenerative Model
Ho Kei Cheng
Ho Kei Cheng
University of Illinois Urbana-Champaign
Computer VisionMachine Learning
Ping Nie
Ping Nie
Waterloo University
Natural Language ProcessingInformation RetrievalRecommendation SystemsTime Series Forecasting