🤖 AI Summary
Existing text-to-image benchmarks either emphasize reasoning comprehension or world knowledge and visual concepts, lacking rigorous, multi-disciplinary, high-stakes generative assessment aligned with standardized examinations.
Method: We introduce GenExam—the first standardized-exam-style benchmark for multi-disciplinary text-to-image generation, covering 10 academic disciplines and 1,000 exam-style questions. It features a novel four-level discipline–capability taxonomy, fine-grained scoring rubrics, reference answer images, and examination-inspired prompt design to jointly evaluate knowledge integration, logical reasoning, and visual generation fidelity.
Contribution/Results: Comprehensive evaluation reveals that state-of-the-art models—including GPT-Image-1 and Gemini-2.5-Flash-Image—achieve strict scores below 15%, with most near zero. This demonstrates GenExam’s strong discriminative power and exceptional difficulty, establishing a new, rigorous standard for assessing generative multimodal intelligence.
📝 Abstract
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.