Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

📅 2024-09-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study systematically compares large language models (LLMs) with classical machine learning (CML) methods for mortality prediction from high-dimensional tabular clinical data of COVID-19 patients—a task traditionally dominated by CMLs. Method: Structured electronic health record features are converted into natural language prompts; GPT-4 is applied in zero-shot classification, while Mistral-7B is fine-tuned via QLoRA. Performance is benchmarked against XGBoost and Random Forest. Contribution/Results: CMLs achieve strong internal/external validation F1-scores of 0.87/0.83. GPT-4 zero-shot performs poorly (F1 = 0.43), whereas QLoRA-finetuned Mistral-7B attains F1 = 0.74 and recall = 79%, with stable external generalization. Crucially, this work demonstrates that lightweight LLMs—when efficiently adapted via parameter-efficient fine-tuning—can approach the predictive performance of established CMLs on structured healthcare prediction tasks. It establishes a novel paradigm for leveraging LLMs in tabular biomedical data analysis, bridging the gap between natural language processing and clinical informatics.

Technology Category

Application Category

📝 Abstract

Background: This study aimed to evaluate and compare the performance of classical machine learning models (CMLs) and large language models (LLMs) in predicting mortality associated with COVID-19 by utilizing a high-dimensional tabular dataset. Materials and Methods: We analyzed data from 9,134 COVID-19 patients collected across four hospitals. Seven CML models, including XGBoost and random forest (RF), were trained and evaluated. The structured data was converted into text for zero-shot classification by eight LLMs, including GPT-4 and Mistral-7b. Additionally, Mistral-7b was fine-tuned using the QLoRA approach to enhance its predictive capabilities. Results: Among the CML models, XGBoost and RF achieved the highest accuracy, with F1 scores of 0.87 for internal validation and 0.83 for external validation. In the LLM category, GPT-4 was the top performer with an F1 score of 0.43. Fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, resulting in an F1 score of 0.74, which was stable during external validation. Conclusion: While LLMs show moderate performance in zero-shot classification, fine-tuning can significantly enhance their effectiveness, potentially aligning them closer to CML models. However, CMLs still outperform LLMs in high-dimensional tabular data tasks.

Problem

Research questions and friction points this paper is trying to address.

Comparing classical ML and LLMs for COVID-19 mortality prediction

Evaluating performance using high-dimensional tabular patient data

Assessing fine-tuning impact on LLMs for medical classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used XGBoost and Random Forest for high-dimensional data

Applied zero-shot classification with GPT-4 and Mistral-7b

Fine-tuned Mistral-7b using QLoRA approach for improvement

🔎 Similar Papers

No similar papers found.