Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Large language models (LLMs) suffer from knowledge staleness and hallucination due to reliance on static training data; retrieval-augmented generation (RAG) mitigates this by incorporating external, dynamic information, while multimodal RAG further integrates heterogeneous modalities—such as text, images, audio, and video—posing unique challenges in cross-modal alignment and joint reasoning. To address these, this paper proposes the first unified analytical framework for multimodal RAG, systematically organizing datasets, evaluation benchmarks, assessment dimensions, and technical pathways. It introduces a novel end-to-end taxonomy covering cross-modal retrieval, heterogeneous fusion, dynamic knowledge injection, and modality-adaptive generation. Furthermore, we open-source a standardized evaluation resource library on GitHub. This work establishes both theoretical foundations and practical paradigms for building high-fidelity, real-time-updating, controllable, and trustworthy multimodal AI systems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations and outdated knowledge in LLMs

Explores challenges in cross-modal alignment and reasoning

Surveys advances in Multimodal Retrieval-Augmented Generation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal RAG systems

Cross-modal alignment challenges

Dynamic external knowledge bases

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models