Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from knowledge staleness and hallucination due to reliance on static training data; retrieval-augmented generation (RAG) mitigates this by incorporating external, dynamic information, while multimodal RAG further integrates heterogeneous modalities—such as text, images, audio, and video—posing unique challenges in cross-modal alignment and joint reasoning. To address these, this paper proposes the first unified analytical framework for multimodal RAG, systematically organizing datasets, evaluation benchmarks, assessment dimensions, and technical pathways. It introduces a novel end-to-end taxonomy covering cross-modal retrieval, heterogeneous fusion, dynamic knowledge injection, and modality-adaptive generation. Furthermore, we open-source a standardized evaluation resource library on GitHub. This work establishes both theoretical foundations and practical paradigms for building high-fidelity, real-time-updating, controllable, and trustworthy multimodal AI systems.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations and outdated knowledge in LLMs
Explores challenges in cross-modal alignment and reasoning
Surveys advances in Multimodal Retrieval-Augmented Generation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal RAG systems
Cross-modal alignment challenges
Dynamic external knowledge bases
🔎 Similar Papers
No similar papers found.
M
Mohammad Mahdi Abootorabi
Qatar Computing Research Institute, Doha, Qatar
A
Amirhosein Zobeiri
College of Interdisciplinary Science and Technology, University of Tehran, Tehran, Iran
M
Mahdi Dehghani
Computer Engineering Department, K.N. Toosi University of Technology, Tehran, Iran
M
Mohammad Mohammadkhani
Computer Engineering Department, Sharif University of Technology, Tehran, Iran
B
Bardia Mohammadi
Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Omid Ghahroodi
Omid Ghahroodi
Research Assistant at Qatar Computing Research Institute, Sharif University of Technology Alumni
Machine LearningDeep LearningNatural Language ProcessingLLMVLM
M
M. Baghshah
Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Ehsaneddin Asgari
Ehsaneddin Asgari
Scientist at QCRI, UC Berkeley PhD Alum., Prev@ Helmholtz Center, MIT-CSAIL, MIT-BCS, LMU, EPFL, SUT
Natural Language ProcessingBioinformaticsDeep LearningDigital HumanitiesMachine Learning