Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Clinical narrative reports—such as pathology and radiology notes across diverse languages (e.g., Chinese, English, Spanish) and institutions—pose significant challenges for structured data extraction, particularly in multilingual, multi-disease, and cross-institutional settings. Method: We systematically evaluated 15 open-source large language models (LLMs) on six disease domains using six prompting strategies—zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graphs—within real-world multinational clinical environments. Rigorous evaluation employed macro-F1 scores, consensus-based ranking aggregation, and linear mixed-effects modeling to account for hierarchical data structure. Contribution/Results: Medium- and small-scale general-purpose LLMs achieved performance comparable to large models; prompt graphs and few-shot prompting yielded the most substantial gains. Task-specific characteristics outweighed parameter count in predictive impact. The best-performing model attained macro-F1 scores approaching inter-annotator agreement, demonstrating strong cross-lingual, cross-disease, and cross-institutional robustness and scalability—establishing an efficient, low-cost, open-source solution for clinical text structuring.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating open-weight LLMs for structured data extraction from medical narratives

Assessing performance across multiple clinical use cases and languages

Comparing model sizes and prompting strategies for clinical data curation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 15 open-weight LLMs across multiple medical domains

Compared six prompting strategies including graph and few-shot

Achieved performance close to human inter-rater agreement

🔎 Similar Papers

No similar papers found.