M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical RAG systems frequently suffer from factual inaccuracies due to hallucinations and insufficient utilization of external knowledge. To address this, we propose a multi-evidence verification framework that—novel in the RAG domain—adapts heterogeneity analysis from evidence-based medicine: it assesses both answer credibility and evidentiary reliability by evaluating consistency across multiple retrieved literature sources. Our method integrates external knowledge retrieval, multi-document evidence extraction, and LLM-driven response generation with joint verification. Experiments across multiple state-of-the-art LLMs demonstrate up to a 23.31% absolute improvement in accuracy, significantly enhancing factual consistency in medical question answering and mitigating misdiagnosis risks. The core contribution lies in establishing an interpretable, verifiable evidence coordination mechanism—introducing a new paradigm for trustworthy medical RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.
Problem

Research questions and friction points this paper is trying to address.

Detects factual errors in medical RAG system responses
Validates evidence reliability using multi-source heterogeneity analysis
Reduces hallucinations and diagnostic errors in medical QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses heterogeneity analysis to validate multiple evidence sources
Extracts external medical literature for cross-referencing responses
Assesses evidence reliability and accuracy in RAG systems
🔎 Similar Papers
No similar papers found.