🤖 AI Summary
To address three key challenges in audio-visual multimodal RAG—narrow modality coverage of knowledge graphs, weak multi-hop connectivity, and imprecise retrieval—this paper proposes a query-aligned Multi-hop Multimodal Knowledge Graph (M³KG) construction and retrieval framework. We introduce a novel lightweight multi-agent construction method that significantly expands the modality granularity and cross-modal path depth of multimodal knowledge graphs (MMKGs). Furthermore, we design the GRASP mechanism—comprising query-driven entity anchoring, supportiveness assessment, and redundant context pruning—to enhance retrieval precision. By integrating modality-aware retrieval, query grounding, relevance scoring, and embedding alignment, our approach improves fact consistency and cross-modal localization accuracy for multimodal large language models (MLLMs) in multi-hop reasoning. Extensive evaluation across multiple multimodal benchmarks demonstrates substantial gains in answer faithfulness and reasoning depth.
📝 Abstract
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.