🤖 AI Summary
This study systematically compares large language models (LLMs) with classical machine learning (CML) methods for mortality prediction from high-dimensional tabular clinical data of COVID-19 patients—a task traditionally dominated by CMLs.
Method: Structured electronic health record features are converted into natural language prompts; GPT-4 is applied in zero-shot classification, while Mistral-7B is fine-tuned via QLoRA. Performance is benchmarked against XGBoost and Random Forest.
Contribution/Results: CMLs achieve strong internal/external validation F1-scores of 0.87/0.83. GPT-4 zero-shot performs poorly (F1 = 0.43), whereas QLoRA-finetuned Mistral-7B attains F1 = 0.74 and recall = 79%, with stable external generalization. Crucially, this work demonstrates that lightweight LLMs—when efficiently adapted via parameter-efficient fine-tuning—can approach the predictive performance of established CMLs on structured healthcare prediction tasks. It establishes a novel paradigm for leveraging LLMs in tabular biomedical data analysis, bridging the gap between natural language processing and clinical informatics.
📝 Abstract
Background: This study aimed to evaluate and compare the performance of classical machine learning models (CMLs) and large language models (LLMs) in predicting mortality associated with COVID-19 by utilizing a high-dimensional tabular dataset. Materials and Methods: We analyzed data from 9,134 COVID-19 patients collected across four hospitals. Seven CML models, including XGBoost and random forest (RF), were trained and evaluated. The structured data was converted into text for zero-shot classification by eight LLMs, including GPT-4 and Mistral-7b. Additionally, Mistral-7b was fine-tuned using the QLoRA approach to enhance its predictive capabilities. Results: Among the CML models, XGBoost and RF achieved the highest accuracy, with F1 scores of 0.87 for internal validation and 0.83 for external validation. In the LLM category, GPT-4 was the top performer with an F1 score of 0.43. Fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, resulting in an F1 score of 0.74, which was stable during external validation. Conclusion: While LLMs show moderate performance in zero-shot classification, fine-tuning can significantly enhance their effectiveness, potentially aligning them closer to CML models. However, CMLs still outperform LLMs in high-dimensional tabular data tasks.