Benchmarking LLMs for Predictive Applications in the Intensive Care Units

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Clinical shock prediction in ICU settings remains challenging due to complex temporal dynamics and severe class imbalance (e.g., Shock Index > 0.7). Method: This study systematically evaluates large language models (LLMs) on ICU shock prediction using a benchmark of 17,294 longitudinal clinical episodes derived from MIMIC-III. It comparatively assesses general-purpose LLMs—including GatorTron-Base, Llama-8B, and Mistral-7B—against domain-specialized small language models (SLMs) such as BioBERT and Doc2Vec. To address class imbalance, we propose a joint optimization strategy combining focal loss and cross-entropy loss. Contribution/Results: GatorTron-Base achieves the highest weighted recall (80.5%), yet overall LLM performance is comparable to SLMs—no statistically significant advantage is observed. The findings challenge the assumption that general-purpose LLMs inherently outperform domain-adapted models in intricate clinical time-series forecasting. We advocate a paradigm shift toward modeling dynamic clinical processes explicitly, rather than relying on generic sequence modeling capabilities.

Technology Category

Application Category

📝 Abstract

With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for predicting shock in ICU patients

Comparing LLMs and SLMs on clinical predictive tasks

Addressing class imbalance in clinical event prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparing LLMs and SLMs for clinical shock prediction

Using focal and cross-entropy loss to address class imbalance

Training models on clinical data for trajectory prediction

🔎 Similar Papers

No similar papers found.