MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Evaluating the factual accuracy of clinical large language model (LLM) outputs faces a fundamental scalability bottleneck due to reliance on time-intensive expert review. To address this, we propose a scalable framework for factuality assessment and generation in clinical settings. First, we introduce the “LLM Jury” mechanism—a novel, automated approach that leverages multi-model majority voting to assess critical facts in clinical summaries. Second, we design a model-agnostic, multi-step collaborative generation workflow. Third, using clinical expert annotations and Cohen’s kappa (κ = 0.79), we demonstrate 81% agreement between our method and expert panels, with non-inferior performance relative to individual experts. Our framework significantly enhances both the factual reliability and evaluation efficiency of clinical AI outputs, establishing a scalable, rigorously validated quality assurance paradigm for generative AI deployment in healthcare.

Technology Category

Application Category

📝 Abstract

Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy in LLM-generated clinical text

Scalable expert review for continuous quality assurance

Generating high-quality factual clinical discharge summaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-LLM jury system for scalable clinical fact evaluation

Model-agnostic workflow for generating factual discharge summaries

Framework combining automated evaluation with physician-validated benchmarks

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models