M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in audio-visual multimodal RAG—narrow modality coverage of knowledge graphs, weak multi-hop connectivity, and imprecise retrieval—this paper proposes a query-aligned Multi-hop Multimodal Knowledge Graph (M³KG) construction and retrieval framework. We introduce a novel lightweight multi-agent construction method that significantly expands the modality granularity and cross-modal path depth of multimodal knowledge graphs (MMKGs). Furthermore, we design the GRASP mechanism—comprising query-driven entity anchoring, supportiveness assessment, and redundant context pruning—to enhance retrieval precision. By integrating modality-aware retrieval, query grounding, relevance scoring, and embedding alignment, our approach improves fact consistency and cross-modal localization accuracy for multimodal large language models (MLLMs) in multi-hop reasoning. Extensive evaluation across multiple multimodal benchmarks demonstrates substantial gains in answer faithfulness and reasoning depth.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal retrieval with multi-hop knowledge graphs
Filters irrelevant knowledge for precise audio-visual grounding
Improves reasoning depth and faithfulness in multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs multi-hop multimodal knowledge graphs with enriched triplets
Uses GRASP for precise entity grounding and relevance evaluation
Prunes redundant context to retain essential knowledge only
🔎 Similar Papers
No similar papers found.