🤖 AI Summary
The multimodal retrieval-augmented multimodal generation (M²RAG) task has long suffered from a lack of systematic formalization, high-quality benchmarks, and reliable evaluation protocols—hindering foundation models’ ability to comprehend and respond to web-scale multimodal content. To address this, we introduce MMRAG-Bench, the first comprehensive benchmark for M²RAG. Our methodology includes a rigorous data cleaning pipeline, a novel multimodal evaluation framework integrating text-semantic and vision-text alignment metrics (e.g., CLIPScore and LLM-based judgment), and cross-modal alignment training with high-quality sample selection. Experimental results demonstrate that fine-tuned 7B–8B parameter models outperform GPT-4o across multiple metrics; our evaluation metrics exhibit high inter-annotator agreement and reliability; and all datasets, code, and model weights are fully open-sourced to enable fine-grained, cross-domain performance analysis.
📝 Abstract
We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG effectively and construct a training set by filtering high-quality samples using designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the state-of-the-art GPT-4o model. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.