MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective

📅 2024-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LMM evaluation benchmarks are often domain-specific, rely heavily on manual annotations, and employ short answer formats, thus failing to assess models’ deep visual understanding and generative capabilities. Method: We propose MMGenBench—the first fully automated, multimodal generative evaluation framework—based on a closed-loop pipeline: “image → textual description → generated image → similarity comparison,” eliminating manual annotation. It introduces an end-to-end self-supervised paradigm covering 13 general image patterns (MMGenBench-Test) and domain-oriented subsets (MMGenBench-Domain), implemented via a multi-stage pipeline integrating CLIP/VLMs, Stable Diffusion, and DINOv2/CLIP-Sim for cross-domain generalization assessment. Contribution/Results: Evaluated on 50+ state-of-the-art LMMs, our framework reveals that over 60% of models achieving high scores on conventional benchmarks exhibit significant failure on generative understanding tasks, exposing critical deficiencies in fundamental vision-language generation capabilities.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) demonstrate impressive capabilities. However, current benchmarks predominantly focus on image comprehension in specific domains, and these benchmarks are labor-intensive to construct. Moreover, their answers tend to be brief, making it difficult to assess the ability of LMMs to generate detailed descriptions of images. To address these limitations, we propose the MMGenBench-Pipeline, a straightforward and fully automated evaluation pipeline. This involves generating textual descriptions from input images, using these descriptions to create auxiliary images via text-to-image generative models, and then comparing the original and generated images. Furthermore, to ensure the effectiveness of MMGenBench-Pipeline, we design MMGenBench-Test, evaluating LMMs across 13 distinct image patterns, and MMGenBench-Domain, focusing on generative image performance. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability of both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, MMGenBench-Pipeline can efficiently assess the performance of LMMs across diverse domains using only image inputs.
Problem

Research questions and friction points this paper is trying to address.

Automated evaluation of Large Multimodal Models (LMMs) for image description generation.
Addressing limitations in current benchmarks for detailed image understanding.
Assessing LMMs' performance across diverse domains using image inputs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evaluation pipeline for LMMs
Text-to-image generation for detailed descriptions
Cross-domain LMM performance assessment
🔎 Similar Papers
No similar papers found.
H
Hailang Huang
Beihang University, Alibaba Group
Y
Yong Wang
Alibaba Group
Z
Zixuan Huang
Beihang University, Alibaba Group
Huaqiu Li
Huaqiu Li
Tsinghua University
computer visionmachine learning
T
Tongwen Huang
Alibaba Group
X
Xiangxiang Chu
Alibaba Group
Richong Zhang
Richong Zhang
Professor of Computer Science, Beihang University
Data MiningRecommender SystemSocial Computing