Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (LLMs) face two critical challenges: knowledge obsolescence and unverifiable reasoning. While retrieval-augmented generation (RAG) is widely adopted, its reliability in clinical evidence retrieval, filtering, and response generation remains poorly characterized. This study—through the first large-scale expert annotation effort—systematically exposes RAG’s performance degradation in medical settings: only 22% of retrieved documents are clinically relevant; evidence utilization precision is merely 41–43%; and response factuality declines by 6%. To address this, we propose a stage-aware evaluation framework and design deployable optimizations—including evidence filtering and query reformulation. Evaluated on GPT-4o and Llama-3.1-8B, our approach improves accuracy on MedMCQA and MedXpertQA by 12.0% and 8.2%, respectively. These results significantly enhance RAG’s trustworthiness and practical utility in evidence-based medicine.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG reliability in medical applications through expert assessment
Addressing performance degradation caused by poor evidence retrieval and selection
Developing strategies to improve medical LLM factuality and completeness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically decomposing RAG pipeline into three components
Using evidence filtering to mitigate retrieval failures
Applying query reformulation for improved evidence selection
🔎 Similar Papers
No similar papers found.
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
J
Jiwoong Sohn
Department of Biosystems Science and Engineering, ETH Zurich, Zurich, Switzerland
Aidan Gilson
Aidan Gilson
Massachusetts Eye and Ear, Harvard Medical School
OphthalmologyMachine LearningArtificial Intelligence
N
Nicholas C Cochran-Caggiano
Geisel School of Medicine at Dartmouth, Hanover, NH, USA
S
Serina S Applebaum
Yale School of Medicine, Yale University, New Haven, CT, USA
H
Heeju Jin
Seoul National University College of Medicine, Seoul, Republic of Korea
S
Seihee Park
Seoul National University College of Medicine, Seoul, Republic of Korea
Yujin Park
Yujin Park
Georgia Southern University
Online LearningTeacher Professional LearningElementary STEMDigital LiteracyOER
J
Jiyeong Park
Seoul National University College of Medicine, Seoul, Republic of Korea
S
Seoyoung Choi
Seoul National University College of Medicine, Seoul, Republic of Korea
B
Brittany Alexandra Herrera Contreras
Yale School of Medicine, Yale University, New Haven, CT, USA
T
Thomas Huang
Yale School of Medicine, Yale University, New Haven, CT, USA
J
J. Yun
Hanyang University College of Medicine, Seoul, Republic of Korea
E
Ethan F. Wei
Yale School of Medicine, Yale University, New Haven, CT, USA
R
Roy Jiang
Yale School of Medicine, Yale University, New Haven, CT, USA
L
Leah Colucci
Yale School of Medicine, Yale University, New Haven, CT, USA
E
Eric Lai
Yale School of Medicine, Yale University, New Haven, CT, USA
A
Amisha D. Dave
Yale School of Medicine, Yale University, New Haven, CT, USA
T
Tuo Guo
Yale School of Medicine, Yale University, New Haven, CT, USA
M
Maxwell B. Singer
Yale School of Medicine, Yale University, New Haven, CT, USA
Y
Yonghoe Koo
Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
R
Ron A. Adelman
Yale School of Medicine, Yale University, New Haven, CT, USA
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
A
Andrew Taylor
University of Virginia School of Medicine, Charlottesville, VA, USA
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence
H
Hua Xu
Yale School of Medicine, Yale University, New Haven, CT, USA
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis