Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of reliable, standardized quality assessment tools hinders clinical deployment of large language models (LLMs) for electronic health record (EHR) summarization. Method: We developed and validated PDSQI-9—the first standardized scale specifically designed to evaluate LLM-generated clinical summaries—based on multi-specialty real-world EHR data and outputs from GPT-4o, Mixtral, and Llama 3. The scale operationalizes a four-dimensional framework: organization, clarity, accuracy, and clinical utility. Rigorous validation included content, structural, criterion-related, discriminant, and generalizability validity assessments, conducted via semi-Delphi consensus, exploratory factor analysis, and multi-index reliability testing (Cronbach’s α = 0.879; ICC = 0.867). Results: PDSQI-9 demonstrates high psychometric validity and reliability, with statistically significant intergroup discrimination (p < 0.001). It provides a reproducible, generalizable benchmark for evaluating LLM-generated clinical summaries in real-world healthcare settings.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high- from low-quality summaries (p<0.001). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Records
Quality Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

PDSQI-9
Medical Summary Quality Assessment
Large Language Model Evaluation
🔎 Similar Papers
No similar papers found.
Emma Croxford
Emma Croxford
PhD Student, University of Wisconsin - Madison
evaluationclinical natural langauge generationlarge langauge models
Yanjun Gao
Yanjun Gao
University of Colorado; University of Wisconsin Madison
Natural Language ProcessingArtificial IntelligenceHealth InformaticsEducation Technology
Nicholas Pellegrino
Nicholas Pellegrino
Doctoral Candidate, Systems Design Engineering, University of Waterloo
K
Karen K. Wong
UW Health, Madison, WI; Epic Systems, Verona, WI
G
Graham Wills
UW Health, Madison, WI
E
Elliot First
Epic Systems, Verona, WI
M
Miranda Schnier
Epic Systems, Verona, WI
K
Kyle Burton
UW Health, Madison, WI
C
Cris G. Ebby
Department of Pediatrics, University of Wisconsin, Madison
J
Jillian Gorskic
UW Health, Madison, WI
M
Matthew Kalscheur
UW Health, Madison, WI; Department of Medicine, University of Wisconsin, Madison
S
Samy Khalil
UW Health, Madison, WI
M
Marie Pisani
Department of Medicine, University of Wisconsin, Madison
T
Tyler Rubeor
UW Health, Madison, WI
P
Peter Stetson
Memorial Sloan Kettering Cancer Center, New York, NY
F
Frank Liao
BerbeeWalsh Department of Emergency Medicine, University of Wisconsin, Madison; UW Health, Madison, WI
C
Cherodeep Goswami
UW Health, Madison, WI
B
Brian Patterson
BerbeeWalsh Department of Emergency Medicine, University of Wisconsin, Madison; UW Health, Madison, WI
Majid Afshar
Majid Afshar
University of Wisconsin - Madison
Natural Language ProcessingArtificial IntelligenceMed InformaticsCritical CareUW-Informatics