Enhancing LLMs with Smart Preprocessing for EHR Analysis

šŸ“… 2024-12-03
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
To address the challenge of adapting large language models (LLMs) to electronic health record (EHR) analysis under stringent privacy constraints and limited computational resources, this paper proposes a lightweight, on-device LLM framework. Methodologically, it introduces a novel regular-expression-based pre-filtering mechanism synergized with retrieval-augmented generation (RAG) to suppress noise in lengthy, unstructured EHR texts. Integrated with zero-/few-shot learning, model compression, and GPU-free deployment, the framework ensures end-to-end privacy preservation and efficient inference. Evaluated on MIMIC-IV and other clinical datasets, it boosts accuracy by 23.5% on tasks including diagnosis extraction and critical biomarker identification—outperforming comparably sized fine-tuned models—and enables real-time inference on CPU-only servers. Key contributions include: (1) the first EHR-specific lightweight on-device deployment paradigm, and (2) a privacy-aware, computationally efficient preprocessing–retrieval–generation co-design architecture that jointly optimizes privacy, latency, and task performance.

Technology Category

Application Category

šŸ“ Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLMs for EHR analysis with limited computational resources
Enhancing privacy-sensitive healthcare applications using compact LLMs
Improving clinical note processing via smart preprocessing techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact LLM framework for local deployment
Regex and RAG for preprocessing EHRs
Zero-shot and few-shot learning evaluation
šŸ”Ž Similar Papers
No similar papers found.
Y
Yixiang Qu
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Yifan Dai
Yifan Dai
Hunan University
LLMAgentAI4Science
S
Shilin Yu
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
P
Pradham Tanikella
Department of Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
T
Travis P. Schrank
Otolaryngology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
T
Trevor Hackman
Otolaryngology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Didong Li
Didong Li
Assistant Professor, Department of Biostatistics, Gillings School of Global Public Health, UNC
Manifold learninggeometric data analysisnonparametric BayesGaussian processesspatial statistics
D
Di Wu
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA