Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baseliness

📅 2024-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The multimodal retrieval-augmented multimodal generation (M²RAG) task has long suffered from a lack of systematic formalization, high-quality benchmarks, and reliable evaluation protocols—hindering foundation models’ ability to comprehend and respond to web-scale multimodal content. To address this, we introduce MMRAG-Bench, the first comprehensive benchmark for M²RAG. Our methodology includes a rigorous data cleaning pipeline, a novel multimodal evaluation framework integrating text-semantic and vision-text alignment metrics (e.g., CLIPScore and LLM-based judgment), and cross-modal alignment training with high-quality sample selection. Experimental results demonstrate that fine-tuned 7B–8B parameter models outperform GPT-4o across multiple metrics; our evaluation metrics exhibit high inter-annotator agreement and reliability; and all datasets, code, and model weights are fully open-sourced to enable fine-grained, cross-domain performance analysis.

Technology Category

Application Category

📝 Abstract
We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG effectively and construct a training set by filtering high-quality samples using designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the state-of-the-art GPT-4o model. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Investigates Multi-modal Retrieval Augmented Generation
Addresses lack of comprehensive analysis and data
Proposes strategies and metrics for model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal retrieval augmented generation
Foundation model-based evaluation metrics
Fine-tuned 7B-8B models outperform GPT-4
🔎 Similar Papers
No similar papers found.
Z
Zi-Ao Ma
School of Computer Science and Technology, Beijing Institute of Technology, China
T
Tian Lan
School of Computer Science and Technology, Beijing Institute of Technology, China
Rong-Cheng Tu
Rong-Cheng Tu
Nanyang Technological University
Image and Video RetrievalCross-modal RetrievalDeep Learning
Y
Yong Hu
WeChat AI, Tencent Inc., China
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology, China
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing