Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baseliness

📅 2024-11-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The multimodal retrieval-augmented multimodal generation (M²RAG) task has long suffered from a lack of systematic formalization, high-quality benchmarks, and reliable evaluation protocols—hindering foundation models’ ability to comprehend and respond to web-scale multimodal content. To address this, we introduce MMRAG-Bench, the first comprehensive benchmark for M²RAG. Our methodology includes a rigorous data cleaning pipeline, a novel multimodal evaluation framework integrating text-semantic and vision-text alignment metrics (e.g., CLIPScore and LLM-based judgment), and cross-modal alignment training with high-quality sample selection. Experimental results demonstrate that fine-tuned 7B–8B parameter models outperform GPT-4o across multiple metrics; our evaluation metrics exhibit high inter-annotator agreement and reliability; and all datasets, code, and model weights are fully open-sourced to enable fine-grained, cross-domain performance analysis.

Technology Category

Application Category

📝 Abstract

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG effectively and construct a training set by filtering high-quality samples using designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the state-of-the-art GPT-4o model. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Investigates Multi-modal Retrieval Augmented Generation

Addresses lack of comprehensive analysis and data

Proposes strategies and metrics for model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal retrieval augmented generation

Foundation model-based evaluation metrics

Fine-tuned 7B-8B models outperform GPT-4

🔎 Similar Papers

No similar papers found.

Authors to Follow