🤖 AI Summary
Evaluating the factual accuracy of clinical large language model (LLM) outputs faces a fundamental scalability bottleneck due to reliance on time-intensive expert review. To address this, we propose a scalable framework for factuality assessment and generation in clinical settings. First, we introduce the “LLM Jury” mechanism—a novel, automated approach that leverages multi-model majority voting to assess critical facts in clinical summaries. Second, we design a model-agnostic, multi-step collaborative generation workflow. Third, using clinical expert annotations and Cohen’s kappa (κ = 0.79), we demonstrate 81% agreement between our method and expert panels, with non-inferior performance relative to individual experts. Our framework significantly enhances both the factual reliability and evaluation efficiency of clinical AI outputs, establishing a scalable, rigorously validated quality assurance paradigm for generative AI deployment in healthcare.
📝 Abstract
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.