🤖 AI Summary
This work addresses the limitation of traditional medical retrieval-augmented generation (RAG) systems, which rely solely on textual passages and overlook critical visual information such as figures, tables, and structured layouts in scientific literature. To bridge this gap, the authors propose MED-VRAG, a novel framework that leverages PMC document page images—rather than OCR-extracted text—as direct input for multimodal iterative retrieval and reasoning in medical question answering. The approach integrates page-level image embeddings (based on ColQwen2.5), a MapReduce-style chunked LLM filter, approximate nearest neighbor indexing, a vision-language model, and a memory bank to support multi-turn evidence accumulation. Evaluated on four medical QA benchmarks, MED-VRAG achieves an average accuracy of 78.6%, outperforming both non-retrieval baselines by 5.8 points and the MedRAG+GPT-4 system, while ablation studies confirm the efficacy of image-based retrieval, iterative refinement, and memory mechanisms.
📝 Abstract
Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.