🤖 AI Summary
This study addresses the challenges of clinical information extraction from medical reports in resource-constrained settings—namely, unstructured text, domain-specific linguistic complexity, opacity of proprietary models, and data privacy risks—by proposing LLM-Extractinator, an open-source framework. Methodologically, it conducts the first systematic zero-shot evaluation of open-weight large language models—including Phi-4-14B, Qwen-2.5-14B, DeepSeek-R1-14B, and Llama-3.3—on Dutch clinical texts, employing localized prompting and zero-shot inference to avoid translation-induced degradation. Results show that 14B-parameter models (e.g., Phi-4, Qwen-2.5, DeepSeek-R1) achieve near-optimal performance, substantially outperforming smaller models, while 70B models yield marginal gains at prohibitive computational cost. The core contribution is the empirical validation of lightweight, open-source LLMs for real-world clinical NLP tasks, demonstrating their efficacy, feasibility, and privacy-preserving potential. The framework and evaluation benchmark are publicly released to foster reproducible, privacy-aware clinical AI deployment.
📝 Abstract
Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed exttt{llm_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.