Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limitation of traditional medical retrieval-augmented generation (RAG) systems, which rely solely on textual passages and overlook critical visual information such as figures, tables, and structured layouts in scientific literature. To bridge this gap, the authors propose MED-VRAG, a novel framework that leverages PMC document page images—rather than OCR-extracted text—as direct input for multimodal iterative retrieval and reasoning in medical question answering. The approach integrates page-level image embeddings (based on ColQwen2.5), a MapReduce-style chunked LLM filter, approximate nearest neighbor indexing, a vision-language model, and a memory bank to support multi-turn evidence accumulation. Evaluated on four medical QA benchmarks, MED-VRAG achieves an average accuracy of 78.6%, outperforming both non-retrieval baselines by 5.8 points and the MedRAG+GPT-4 system, while ablation studies confirm the efficacy of image-based retrieval, iterative refinement, and memory mechanisms.

📝 Abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

Problem

Research questions and friction points this paper is trying to address.

Medical Retrieval-Augmented Generation

Multimodal Retrieval

Visual Content

Document Page Images

Medical Question Answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal RAG

page-image retrieval

iterative reasoning