VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of automated factual verification for clinical text generated by large language models (LLMs), this paper proposes the first EHR-driven, two-stage validation framework that integrates retrieval-augmented generation (RAG) with the LLM-as-a-judge paradigm to enable fine-grained semantic alignment and clinical fidelity assessment against patients’ real-world electronic health records (EHRs). We introduce VeriFact-BHC—the first clinical fact verification benchmark annotated with EHR-supported evidence—and incorporate clinical NLP, EHR structuring, and semantic mapping techniques. Evaluated on VeriFact-BHC, our method achieves 92.7% inter-rater agreement with human clinicians, significantly surpassing the average clinician performance of 88.5%. This advancement directly addresses a critical bottleneck in trustworthy LLM evaluation within clinical settings.

Technology Category

Application Category

📝 Abstract
Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Article Generation
Electronic Health Records Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

VeriFact
VeriFact-BHC
EHR processing
🔎 Similar Papers
No similar papers found.
Philip Chung
Philip Chung
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
A
Akshay Swaminathan
Department of Biomedical Data Science, Stanford Medicine
A
Alex J. Goodell
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
Y
Yeasul Kim
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
S
S. M. Reincke
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine; Department of Biomedical Data Science, Stanford Medicine; Department of Pediatrics, Stanford Medicine
L
Lichy Han
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
B
Ben Deverett
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
Mohammad Amin Sadeghi
Mohammad Amin Sadeghi
Qatar Computing Research Institute
Machine LearningComputer Vision
A
Abdel-Badih Ariss
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine; Department of Biomedical Data Science, Stanford Medicine; Department of Pediatrics, Stanford Medicine
M
Marc Ghanem
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
D
David Seong
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine; Immunology Program, Stanford Medicine; Medical Scientist Training Program, Stanford Medicine
A
Andrew A. Lee
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
C
Caitlin E. Coombes
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
B
Brad Bradshaw
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
M
Mahir A. Sufian
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
H
Hyo Jung Hong
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
T
Teresa P. Nguyen
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
M
Mohammad R. Rasouli
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
K
Komal Kamra
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
M
Mark A. Burbridge
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
J
James C. McAvoy
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
R
Roya Saffary
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
S
Stephen P. Ma
Department of Medicine, Stanford Medicine
D
Dev Dash
Department of Emergency Medicine, Stanford Medicine
J
James Xie
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
E
Ellen Y. Wang
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
C
Clifford A. Schmiesing
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford Medicine
Nigam Shah
Nigam Shah
Professor of Medicine, and Biomedical Data Science, Stanford University
ontologydata miningmedical informaticsBiomedical Informatics
Nima Aghaeepour
Nima Aghaeepour
Stanford University
Machine LearningArtificial IntelligenceSystems ImmunologyData IntegrationWearable Devices