Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study addresses the challenge of identifying and prioritizing medical terms in patient-readable electronic health record (EHR) notes under low-resource conditions. Method: We propose a systematic comparative framework—first unifying evaluation of prompt engineering (structured and few-shot prompting), LoRA fine-tuning, and ChatGPT-driven data augmentation for medical term extraction—and integrate learning-to-rank for importance scoring, validated via five-fold cross-validation. Contribution/Results: Data augmentation substantially boosts open-source model performance: Mistral-7B achieves an MRR of 0.746 post-augmentation—surpassing GPT-4 Turbo—while GPT-4 Turbo attains the highest F1 score (0.433). Critically, optimal F1 and MRR strategies are decoupled, indicating that term identification and ranking require distinct modeling approaches. Our work establishes a new paradigm for lightweight, interpretable, patient-facing EHR understanding.

Technology Category

Application Category

📝 Abstract

Objective: OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. Materials and Methods: We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Results and Discussion: Fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Conclusion: Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLMs for medical jargon identification

Prioritize key terms from EHR notes

Utilize data augmentation for low-resource settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilized data augmentation for LLMs enhancement

Applied fine-tuning to improve model accuracy

Implemented structured prompts for better outcomes

🔎 Similar Papers

No similar papers found.