MIRA: A Novel Framework for Fusing Modalities in Medical RAG

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from factual inaccuracies in medical question answering and report generation, undermining clinical reliability. Existing retrieval-augmented generation (RAG) approaches face dual challenges: insufficient retrieval leads to missing critical information, while excessive retrieval introduces noise and disrupts the model’s intrinsic reasoning by overwhelming it with external knowledge. To address these issues, we propose a dynamic RAG framework featuring a “Re-think and Re-rank” module that adaptively determines the optimal number of retrieved contexts based on query semantics and image embeddings. Furthermore, we integrate query rewriting with multimodal embedding alignment to enable synergistic verification between the model’s internal knowledge and external medical resources. Evaluated on multiple public medical visual question answering and report generation benchmarks, our method significantly improves factual accuracy and generation quality, achieving state-of-the-art performance. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.
Problem

Research questions and friction points this paper is trying to address.

Addresses factual inconsistencies in medical MLLM responses
Optimizes retrieval balance to avoid irrelevant content
Reduces over-reliance on retrieved data causing errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic retrieval adjustment for factual risk
Image embeddings with medical knowledge base
Query-rewrite module for multimodal reasoning
🔎 Similar Papers
No similar papers found.
J
Jinhong Wang
Department of Computer Vision, MBZUAI, Abu Dhabi, United Arab Emirates
Tajamul Ashraf
Tajamul Ashraf
IIT Delhi, MBZUAI
Computer VisionDeep Learning
Z
Zongyan Han
Department of Computer Vision, MBZUAI, Abu Dhabi, United Arab Emirates
J
Jorma Laaksonen
Department of Computer Science, Aalto University, Aalto, Finland
R
Rao Mohammad Anwer
Department of Computer Vision, MBZUAI, Abu Dhabi, United Arab Emirates