MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing retrieval-augmented generation (RAG) systems in supporting complex cross-modal reasoning within multimodal large language models, where conventional graph construction relies on costly text translation and often discards fine-grained visual information. To overcome these challenges, the authors propose a lightweight, multi-granularity graph RAG framework that unifies textual entities and visual regions into cohesive multimodal nodes. The approach leverages lightweight text parsing and entity-driven visual grounding to construct a hierarchical multimodal knowledge graph, complemented by a multi-granularity graph retrieval mechanism enabling structured multi-hop reasoning. Evaluated across four multimodal benchmarks, the method achieves state-of-the-art performance while accelerating graph construction by 43.3× and reducing associated costs by 23.9×.
📝 Abstract
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
Multimodal Reasoning
Cross-Modal Retrieval
Knowledge Graph
Visual Grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Granularity Graph
Multimodal Retrieval-Augmented Generation
Visual Grounding
Cross-Modal Reasoning
Knowledge Graph
🔎 Similar Papers
S
Sijun Dai
School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen)
Qiang Huang
Qiang Huang
Harbin Institute of Technology (Shenzhen)
DatabasesSimilarity SearchMachine LearningNatural Language Processing
X
Xiaoxing You
School of Computer Science, Hangzhou Dianzi University
Jun Yu
Jun Yu
Shenzhen University
Water Splitting CO2 Electroreduction NH3-SCR