HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks

πŸ“… 2025-08-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Medical large vision-language models (Med-LVLMs) suffer from poor factual consistency and unreliable outputs in clinical deployment, primarily due to the inability of existing retrieval-augmented generation (RAG) systems to efficiently retrieve and fuse multimodal knowledge across heterogeneous sourcesβ€”e.g., imaging reports, biomedical literature, and clinical guidelines. To address this, we propose HeteroRAG, a heterogeneous RAG framework featuring: (1) a modality-specific CLIP variant for precise cross-modal report retrieval; (2) a multi-corpus query generator that dynamically adapts to diverse textual knowledge sources; and (3) a heterogeneous knowledge preference tuning mechanism that explicitly aligns and weights knowledge from disparate sources. Trained on MedAtlas and evaluated across 12 benchmarks spanning three modality types (text-only, image-only, and multimodal), HeteroRAG significantly improves factual accuracy and clinical decision trustworthiness, achieving state-of-the-art performance on multiple metrics.

Technology Category

Application Category

πŸ“ Abstract
Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses factual inaccuracies in medical vision-language models
Enhances retrieval across heterogeneous medical data sources
Improves reliability of clinical decision-making outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-specific CLIPs for effective report retrieval
Multi-corpora Query Generator for diverse corpora
Heterogeneous Knowledge Preference Tuning for alignment
πŸ”Ž Similar Papers
No similar papers found.