On Using Large Language Models to Enhance Clinically-Driven Missing Data Recovery Algorithms in Electronic Health Records

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

154K/year
🤖 AI Summary
Electronic health record (EHR) data frequently suffer from missing and erroneous entries, and conventional manual chart review is labor-intensive and poorly scalable. To address this, we propose a novel paradigm for missing-value imputation that synergistically integrates clinical knowledge with large language models (LLMs). Specifically, we construct a dynamic diagnostic anchor roadmap grounded in ICD-10 codes; the LLM generates auxiliary diagnostic hypotheses, which are iteratively refined by clinical experts to enhance the roadmap’s fidelity—thereby enabling automated, context-aware inference of critical missing values (e.g., laboratory results). This work establishes the first closed-loop, human-in-the-loop optimization framework wherein LLMs and clinicians co-evolve diagnostic reasoning for missing-data recovery. Evaluated on a cohort of 1,000 patients, our method achieves imputation accuracy comparable to or exceeding expert manual review, while substantially improving coverage breadth and inference precision. The approach demonstrates strong clinical deployability and scalability for real-world EHR systems.

Technology Category

Application Category

📝 Abstract
Objective: Electronic health records (EHR) data are prone to missingness and errors. Previously, we devised an "enriched" chart review protocol where a "roadmap" of auxiliary diagnoses (anchors) was used to recover missing values in EHR data (e.g., a diagnosis of impaired glycemic control might imply that a missing hemoglobin A1c value would be considered unhealthy). Still, chart reviews are expensive and time-intensive, which limits the number of patients whose data can be reviewed. Now, we investigate the accuracy and scalability of a roadmap-driven algorithm, based on ICD-10 codes (International Classification of Diseases, 10th revision), to mimic expert chart reviews and recover missing values. Materials and Methods: In addition to the clinicians' original roadmap from our previous work, we consider new versions that were iteratively refined using large language models (LLM) in conjunction with clinical expertise to expand the list of auxiliary diagnoses. Using chart reviews for 100 patients from the EHR at an extensive learning health system, we examine algorithm performance with different roadmaps. Using the larger study of $1000$ patients, we applied the final algorithm, which used a roadmap with clinician-approved additions from the LLM. Results: The algorithm recovered as much, if not more, missing data as the expert chart reviewers, depending on the roadmap. Discussion: Clinically-driven algorithms (enhanced by LLM) can recover missing EHR data with similar accuracy to chart reviews and can feasibly be applied to large samples. Extending them to monitor other dimensions of data quality (e.g., plausability) is a promising future direction.
Problem

Research questions and friction points this paper is trying to address.

Enhancing missing data recovery in EHR using LLM-enhanced algorithms
Automating expert chart reviews to handle EHR data incompleteness
Scaling clinically-driven data imputation with large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using large language models to expand clinical diagnosis roadmaps
Applying ICD-10 code algorithms to recover missing EHR data
Combining LLM suggestions with clinical expertise for validation
🔎 Similar Papers
2024-05-27International Conference on Information and Knowledge ManagementCitations: 4
Sarah C. Lotspeich
Sarah C. Lotspeich
Wake Forest University
BiostatisticsEpidemiologyGlobal HealthPublic Health
A
Abbey Collins
Department of Psychology, North Carolina State University, Raleigh, NC 27607
B
Brian J. Wells
Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC 27157
A
Ashish K. Khanna
Department of Anesthesiology, Division of Critical Care Medicine, Wake Forest University School of Medicine, Winston-Salem, NC 27157, Outcomes Research Consortium, Houston, TX 77030
J
Joseph Rigdon
Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC 27157
Lucy D'Agostino McGowan
Lucy D'Agostino McGowan
Wake Forest University
StatisticsBiostatisticsAnalytic Design TheoryCausal InferenceStatistical Communication