🤖 AI Summary
Automated evaluation of multimodal generation tasks has long suffered from low correlation with human judgments. To address this, we introduce the first comprehensive multimodal benchmark covering four modality combinations—image-only, audio-only, image-text interleaved, and text-audio interleaved—encompassing 49 tasks (29 novel) and 937 instructions to systematically assess core capabilities including reasoning and controllability. We propose a hierarchical evaluation pipeline that achieves a record 94.3% average human alignment, integrating multi-model scoring, programmatic validation, and human calibration to enable fine-grained quantification of cross-modal consistency, structural controllability, and semantic fidelity. Benchmarking 24 state-of-the-art models reveals critical gaps: even the best-performing model, GPT-Image, achieves only 78.3% accuracy on image generation, while audio and interleaved-modality generation remain markedly underdeveloped—highlighting key directions for future advancement.
📝 Abstract
Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.