🤖 AI Summary
Existing unified multimodal models lack systematic, reasoning-centric evaluation benchmarks, hindering the identification of alignment deficits between understanding and generation, as well as generalization bottlenecks on complex visual tasks. To address this, we propose GIR-Bench—the first reasoning-driven benchmark for image generation evaluation. It establishes a fine-grained, interpretable assessment framework across three dimensions: (i) understanding-generation consistency, (ii) reasoning-guided text-to-image generation, and (iii) multi-step editing reasoning. Departing from large-model scoring paradigms, GIR-Bench employs task-specific pipelines integrating logical constraints, implicit knowledge modeling, and multi-step reasoning verification. Extensive experiments on mainstream unified multimodal models and pure generative systems reveal that while unified architectures exhibit inherent reasoning advantages, they suffer from a fundamental capability misalignment—understanding and generation capacities remain systematically decoupled. This gap persists even under rigorous reasoning-oriented evaluation, exposing a critical limitation in current multimodal foundation models.
📝 Abstract
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce extbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.