Multimodal Iterative RAG for Knowledge Visual Question Answering

πŸ“… 2025-08-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient external knowledge acquisition and the limited effectiveness of single-step retrieval-augmented generation (RAG) in knowledge-intensive visual question answering (VQA), this paper proposes MI-RAG, a multimodal iterative RAG framework. Methodologically, MI-RAG introduces: (1) cross-modal joint retrieval coupled with dynamic reasoning updates; (2) iterative multi-query generation guided by cumulative reasoning records; and (3) multi-round cross-modal knowledge fusion for progressive understanding. It enables iterative reasoning over heterogeneous image-text knowledge bases, substantially enhancing complex compositional reasoning capabilities. On benchmarks including Encyclopedic VQA, InfoSeek, and OK-VQA, MI-RAG simultaneously improves both retrieval recall and answer accuracy. Empirical results demonstrate the method’s effectiveness, robustness across diverse knowledge sources, and scalability to increasingly complex reasoning tasks.

Technology Category

Application Category

πŸ“ Abstract
While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.
Problem

Research questions and friction points this paper is trying to address.

Addresses knowledge-intensive visual question answering limitations
Overcomes single-pass retrieval insufficiency in multimodal tasks
Enhances cross-modal reasoning through iterative knowledge refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Iterative RAG framework
Dynamic multi-query formulation
Joint search across heterogeneous knowledge bases
πŸ”Ž Similar Papers
No similar papers found.
C
Changin Choi
Interdisciplinary Program in Artificial Intelligence
W
Wonseok Lee
Interdisciplinary Program in Artificial Intelligence
J
Jungmin Ko
Interdisciplinary Program in Artificial Intelligence
Wonjong Rhee
Wonjong Rhee
Seoul National University
Deep Learning TheoryArtificial IntelligenceInformation Theory