MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

📅 2025-09-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the factual accuracy of clinical large language model (LLM) outputs faces a fundamental scalability bottleneck due to reliance on time-intensive expert review. To address this, we propose a scalable framework for factuality assessment and generation in clinical settings. First, we introduce the “LLM Jury” mechanism—a novel, automated approach that leverages multi-model majority voting to assess critical facts in clinical summaries. Second, we design a model-agnostic, multi-step collaborative generation workflow. Third, using clinical expert annotations and Cohen’s kappa (κ = 0.79), we demonstrate 81% agreement between our method and expert panels, with non-inferior performance relative to individual experts. Our framework significantly enhances both the factual reliability and evaluation efficiency of clinical AI outputs, establishing a scalable, rigorously validated quality assurance paradigm for generative AI deployment in healthcare.

Technology Category

Application Category

📝 Abstract
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy in LLM-generated clinical text
Scalable expert review for continuous quality assurance
Generating high-quality factual clinical discharge summaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-LLM jury system for scalable clinical fact evaluation
Model-agnostic workflow for generating factual discharge summaries
Framework combining automated evaluation with physician-validated benchmarks
F
François Grolleau
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Emily Alsentzer
Emily Alsentzer
Assistant Professor, Stanford University
machine learning for healthcare
T
Timothy Keyes
Stanford Health Care, Palo Alto, CA, USA
Philip Chung
Philip Chung
Department of Anesthesiology and Pain Medicine, Stanford Medicine, Stanford, CA, USA
A
Akshay Swaminathan
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
A
Asad Aali
Department of Radiology, Stanford University, Stanford, CA, USA
J
Jason Hom
Department of Medicine, Stanford University, Stanford, CA, USA
T
Tridu Huynh
Department of Medicine, Stanford University, Stanford, CA, USA
Thomas Lew
Thomas Lew
Toyota Research Institute
RoboticsOptimal ControlMachine Learning
A
April S. Liang
Department of Medicine, Stanford University, Stanford, CA, USA
W
Weihan Chu
Department of Medicine, Stanford University, Stanford, CA, USA
N
Natasha Z. Steele
Department of Medicine, Stanford University, Stanford, CA, USA
C
Christina F. Lin
Department of Medicine, Stanford University, Stanford, CA, USA
J
Jingkun Yang
Department of Medicine, Stanford University, Stanford, CA, USA
K
Kameron C. Black
Department of Medicine, Stanford University, Stanford, CA, USA
S
Stephen P. Ma
Department of Medicine, Stanford University, Stanford, CA, USA
F
Fateme N. Haredasht
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
N
Nigam H. Shah
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA; Department of Medicine, Stanford University, Stanford, CA, USA
K
Kevin Schulman
Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA, USA; Department of Medicine, Stanford University, Stanford, CA, USA
J
Jonathan H. Chen
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA; Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA, USA; Department of Medicine, Stanford University, Stanford, CA, USA