🤖 AI Summary
The practical utility of large language models (LLMs) in non-generative clinical prediction tasks remains unclear, with potential underestimation relative to specialized models (e.g., BERT, traditional ML) and risks of misuse due to absent standardized benchmarks.
Method: We systematically evaluate 9 GPT-family, 5 BERT-family, and 7 traditional ML models across two clinical prediction settings—unstructured clinical text and structured electronic health records (EHRs)—under zero-shot, few-shot, and fine-tuned regimes.
Contribution/Results: First empirical evidence shows state-of-the-art LLMs outperform fine-tuned BERT by +8.2% accuracy in zero-shot settings; open-weight LLMs (e.g., DeepSeek-R1/V3) match or exceed closed-source counterparts (e.g., GPT-4o); and in few-shot EHR tasks, LLMs achieve +5.7% average AUC gain. We propose a data-efficient, prompt-driven paradigm for clinical prediction and demonstrate LLMs’ viability as cost-effective clinical AI tools.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 9 GPT-based LLMs, 5 BERT-based models, and 7 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR). Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-4o, DeepSeek R1/V3) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results establish modern LLMs as powerful non-generative clinical prediction tools, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.