Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of phenotyping pediatric sepsis patients in resource-limited settings, where high-dimensional, heterogeneous clinical data impede conventional clustering approaches and semantic interpretation. We propose the first framework integrating large language models (LLMs) into hybrid clinical data clustering. By leveraging text serialization and task-oriented prompt engineering, our method jointly encodes nutritional, clinical, and socioeconomic features to enable context-aware, interpretable subgroup identification. We evaluate quantized LLaMA-3.1-8B, LoRA-finetuned DeepSeek-R1-Distill-Llama-8B, and Stella-En-400M-V5 embeddings—each coupled with K-means—and benchmark against a UMAP+FAMD+K-medoids baseline. Stella-En-400M-V5 achieves the highest silhouette coefficient (0.86), while LLaMA-3.1 excels in multi-cluster scenarios by precisely distinguishing clinically distinct subphenotypes. Our approach significantly enhances both clustering quality and clinical interpretability, advancing data-driven sepsis phenotyping in low-resource contexts.

Technology Category

Application Category

📝 Abstract
Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
Problem

Research questions and friction points this paper is trying to address.

Clustering pediatric sepsis patients for personalized care
Overcoming limitations of traditional clustering in healthcare data
Evaluating LLM-based methods in low-income country settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-based clustering for pediatric sepsis phenotyping
Generates embeddings with quantized LLAMA and DeepSeek models
Outperforms classical methods with higher Silhouette Scores
🔎 Similar Papers
No similar papers found.
Aditya Nagori
Aditya Nagori
Duke University
Computational BiomedicineGenAI for MedicineIntensive care unitData Science
A
Ayush Gautam
Indian Institute of Technology, Goa, India
Matthew O. Wiens
Matthew O. Wiens
University of British Columbia
EpidemiologyPediatric SepsisPrediction modellingPost-discharge outcomes
Vuong Nguyen
Vuong Nguyen
University of Sydney
Biostatistics
N
Nathan Kenya Mugisha
Walimu, Kampala, Uganda
Jerome Kabakyenga
Jerome Kabakyenga
Mbarara University of Science and Technology
Public HealthMaternal HealthNewborn HealthChild Health
Niranjan Kissoon
Niranjan Kissoon
Professor University of British Columbia
Pediatric MedicineGlobal Health
J
J. M. Ansermino
Institute for Global Health, BC Children’s Hospital and BC Women’s Hospital + Health Centre, Vancouver, BC, Canada; Department of Anesthesia, Pharmacology & Therapeutics, University of British Columbia, Vancouver, BC, Canada; BC Children’s Hospital Research Institute, BC Children’s Hospital, Vancouver, BC, Canada
Rishikesan Kamaleswaran
Rishikesan Kamaleswaran
Duke University
Host-ResponseInjuryCritical CareMachine LearningArtificial Intelligence