π€ AI Summary
To address insufficient external knowledge acquisition and the limited effectiveness of single-step retrieval-augmented generation (RAG) in knowledge-intensive visual question answering (VQA), this paper proposes MI-RAG, a multimodal iterative RAG framework. Methodologically, MI-RAG introduces: (1) cross-modal joint retrieval coupled with dynamic reasoning updates; (2) iterative multi-query generation guided by cumulative reasoning records; and (3) multi-round cross-modal knowledge fusion for progressive understanding. It enables iterative reasoning over heterogeneous image-text knowledge bases, substantially enhancing complex compositional reasoning capabilities. On benchmarks including Encyclopedic VQA, InfoSeek, and OK-VQA, MI-RAG simultaneously improves both retrieval recall and answer accuracy. Empirical results demonstrate the methodβs effectiveness, robustness across diverse knowledge sources, and scalability to increasingly complex reasoning tasks.
π Abstract
While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.