GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal models lack systematic, reasoning-centric evaluation benchmarks, hindering the identification of alignment deficits between understanding and generation, as well as generalization bottlenecks on complex visual tasks. To address this, we propose GIR-Bench—the first reasoning-driven benchmark for image generation evaluation. It establishes a fine-grained, interpretable assessment framework across three dimensions: (i) understanding-generation consistency, (ii) reasoning-guided text-to-image generation, and (iii) multi-step editing reasoning. Departing from large-model scoring paradigms, GIR-Bench employs task-specific pipelines integrating logical constraints, implicit knowledge modeling, and multi-step reasoning verification. Extensive experiments on mainstream unified multimodal models and pure generative systems reveal that while unified architectures exhibit inherent reasoning advantages, they suffer from a fundamental capability misalignment—understanding and generation capacities remain systematically decoupled. This gap persists even under rigorous reasoning-oriented evaluation, exposing a critical limitation in current multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce extbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating understanding-generation consistency in multimodal models
Assessing reasoning-centric text-to-image generation with constraints
Testing multi-step reasoning capabilities in image editing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates multimodal models' understanding-generation consistency
Tests reasoning-centric text-to-image generation with logical constraints
Assesses multi-step reasoning capabilities in image editing tasks
🔎 Similar Papers
No similar papers found.
H
Hongxiang Li
The Hong Kong University of Science and Technology
Yaowei Li
Yaowei Li
Peking University
Computer VisionGenerative Models3D VisionMulti-modal Processing
B
Bin Lin
Peking University
Yuwei Niu
Yuwei Niu
Chongqing university
Visual RepresentationsLanguage Priors
Y
Yuhang Yang
University of Science and Technology of China
X
Xiaoshuang Huang
Xiaohongshu Inc.
J
Jiayin Cai
Xiaohongshu Inc.
X
Xiaolong Jiang
Xiaohongshu Inc.
Yao Hu
Yao Hu
浙江大学
Machine Learning
L
Long Chen
The Hong Kong University of Science and Technology