MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing T2I and image editing benchmarks suffer from a critical disconnect: T2I benchmarks lack multimodal conditioning, while editing benchmarks neglect compositional semantics and commonsense reasoning—leading to incomplete evaluation of multimodal generative models. To address this, we introduce MMIG-Bench, the first comprehensive multimodal image generation benchmark, covering three core tasks—text-to-image generation, image editing, and concept consistency—with 4,850 multi-granularity prompts and 1,750 multi-perspective reference image sets. We propose a three-level interpretable evaluation framework: low-level visual fidelity, mid-level Aspect Matching Score (AMS) grounded in VQA (strongly correlated with human judgment, ρ > 0.87), and high-level aesthetic and preference assessment. Our framework integrates VQA models, multi-scale quality metrics, 32k crowdsourced human ratings, and semantic alignment analysis. Extensive evaluation of 17 state-of-the-art models reveals critical impacts of architectural choices and training data design. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

Problem

Research questions and friction points this paper is trying to address.

Lack of unified evaluation for multi-modal image generation models

Insufficient assessment of compositional semantics and common knowledge

Need for comprehensive metrics aligning with human judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-modal benchmark with rich annotations

Three-level evaluation framework for comprehensive metrics

Validated metrics with extensive human ratings

🔎 Similar Papers

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?