OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the low efficiency of cross-modal, multi-granularity external knowledge retrieval in knowledge-augmented visual question answering (KB-VQA). We propose a “coarse-to-fine” three-stage collaborative retrieval framework: (1) cross-modal coarse-grained initial retrieval, (2) multimodal fusion-based reranking, and (3) fine-grained refinement via a hierarchical attention text reranker aligned with knowledge granularity embeddings. We introduce the first multi-granularity–multimodal coordinated orchestration mechanism, integrating cross-modal contrastive learning with joint modeling of heterogeneous modalities and granularities. Our method achieves state-of-the-art retrieval performance on InfoSeek and Encyclopedic-VQA, yielding significant improvements in KB-VQA answer accuracy. Results validate the effectiveness, efficiency, and scalability of our multimodal RAG paradigm.

Technology Category

Application Category

📝 Abstract
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal retrieval for KB-VQA by harmonizing granularities and modalities
Addressing diverse modalities and knowledge granularities in retrieval-augmented generation
Improving retrieval efficacy via coarse-to-fine multi-step alignment and fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine multi-step retrieval for multimodal RAG
Multimodal fusion reranking for nuanced information capture
Text reranker filters relevant fine-grained sections
🔎 Similar Papers
No similar papers found.
W
Wei Yang
Microsoft Research Asia
Jingjing Fu
Jingjing Fu
MS
image/video processing
R
Rui Wang
Microsoft Research Asia
J
Jinyu Wang
Microsoft Research Asia
L
Lei Song
Microsoft Research Asia
J
Jiang Bian
Microsoft Research Asia