🤖 AI Summary
Large language models (LLMs) suffer from knowledge staleness and hallucination due to reliance on static training data; retrieval-augmented generation (RAG) mitigates this by incorporating external, dynamic information, while multimodal RAG further integrates heterogeneous modalities—such as text, images, audio, and video—posing unique challenges in cross-modal alignment and joint reasoning. To address these, this paper proposes the first unified analytical framework for multimodal RAG, systematically organizing datasets, evaluation benchmarks, assessment dimensions, and technical pathways. It introduces a novel end-to-end taxonomy covering cross-modal retrieval, heterogeneous fusion, dynamic knowledge injection, and modality-adaptive generation. Furthermore, we open-source a standardized evaluation resource library on GitHub. This work establishes both theoretical foundations and practical paradigms for building high-fidelity, real-time-updating, controllable, and trustworthy multimodal AI systems.
📝 Abstract
Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.