🤖 AI Summary
This study addresses the challenge of identifying and prioritizing medical terms in patient-readable electronic health record (EHR) notes under low-resource conditions. Method: We propose a systematic comparative framework—first unifying evaluation of prompt engineering (structured and few-shot prompting), LoRA fine-tuning, and ChatGPT-driven data augmentation for medical term extraction—and integrate learning-to-rank for importance scoring, validated via five-fold cross-validation. Contribution/Results: Data augmentation substantially boosts open-source model performance: Mistral-7B achieves an MRR of 0.746 post-augmentation—surpassing GPT-4 Turbo—while GPT-4 Turbo attains the highest F1 score (0.433). Critically, optimal F1 and MRR strategies are decoupled, indicating that term identification and ranking require distinct modeling approaches. Our work establishes a new paradigm for lightweight, interpretable, patient-facing EHR understanding.
📝 Abstract
Objective: OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. Materials and Methods: We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Results and Discussion: Fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Conclusion: Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.