🤖 AI Summary
Clinical narrative reports—such as pathology and radiology notes across diverse languages (e.g., Chinese, English, Spanish) and institutions—pose significant challenges for structured data extraction, particularly in multilingual, multi-disease, and cross-institutional settings.
Method: We systematically evaluated 15 open-source large language models (LLMs) on six disease domains using six prompting strategies—zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graphs—within real-world multinational clinical environments. Rigorous evaluation employed macro-F1 scores, consensus-based ranking aggregation, and linear mixed-effects modeling to account for hierarchical data structure.
Contribution/Results: Medium- and small-scale general-purpose LLMs achieved performance comparable to large models; prompt graphs and few-shot prompting yielded the most substantial gains. Task-specific characteristics outweighed parameter count in predictive impact. The best-performing model attained macro-F1 scores approaching inter-annotator agreement, demonstrating strong cross-lingual, cross-disease, and cross-institutional robustness and scalability—establishing an efficient, low-cost, open-source solution for clinical text structuring.
📝 Abstract
Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.