GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing unified multimodal models lack systematic, reasoning-centric evaluation benchmarks, hindering the identification of alignment deficits between understanding and generation, as well as generalization bottlenecks on complex visual tasks. To address this, we propose GIR-Bench—the first reasoning-driven benchmark for image generation evaluation. It establishes a fine-grained, interpretable assessment framework across three dimensions: (i) understanding-generation consistency, (ii) reasoning-guided text-to-image generation, and (iii) multi-step editing reasoning. Departing from large-model scoring paradigms, GIR-Bench employs task-specific pipelines integrating logical constraints, implicit knowledge modeling, and multi-step reasoning verification. Extensive experiments on mainstream unified multimodal models and pure generative systems reveal that while unified architectures exhibit inherent reasoning advantages, they suffer from a fundamental capability misalignment—understanding and generation capacities remain systematically decoupled. This gap persists even under rigorous reasoning-oriented evaluation, exposing a critical limitation in current multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce extbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

Problem

Research questions and friction points this paper is trying to address.

Evaluating understanding-generation consistency in multimodal models

Assessing reasoning-centric text-to-image generation with constraints

Testing multi-step reasoning capabilities in image editing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates multimodal models' understanding-generation consistency

Tests reasoning-centric text-to-image generation with logical constraints

Assesses multi-step reasoning capabilities in image editing tasks

🔎 Similar Papers

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?