Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses limitations in identifying suicide-related risk factors—suicidal ideation, suicide attempt, suicide exposure, and non-suicidal self-injury—in psychiatric electronic health records (EHRs). We propose the first generative multi-label classification framework tailored to clinical text, departing from conventional binary classification. Our end-to-end pipeline integrates fine-tuned GPT-3.5 with GPT-4.5–guided prompting and introduces label-set–level evaluation metrics and multi-label confusion matrices to systematically characterize model error patterns and annotator conservatism bias. Experiments demonstrate that the fine-tuned GPT-3.5 achieves 0.94 partial-match accuracy and 0.91 macro-F1; GPT-4.5 significantly outperforms on rare label combinations, exhibiting superior robustness and class-balance capability. This work establishes an interpretable, rigorously evaluable generative AI paradigm for modeling complex, comorbid suicide risk in clinical NLP.

Technology Category

Application Category

📝 Abstract
Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.
Problem

Research questions and friction points this paper is trying to address.

Classify multiple suicidality risk factors from EHRs
Improve early detection of cooccurring suicide-related conditions
Evaluate generative AI models for multi-label clinical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative AI models for multi-label classification
Fine-tuned GPT-3.5 achieves high accuracy
Guided prompting enhances GPT-4.5 performance
🔎 Similar Papers
No similar papers found.
M
Ming Huang
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
Zehan Li
Zehan Li
PhD, UTHealth Houston
AI for Mental HealthPsychiatryBiomedical InformaticsLLMsClinical Phenotyping
Y
Yan Hu
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
W
Wanjing Wang
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
Andrew Wen
Andrew Wen
Data Scientist II, University of Texas Health Sciences Center at Houston | PhD Student @ Rice
Big DataDigital MedicineNatural Language ProcessingClinical NLPInformation Retrieval
S
Scott Lane
Faillace Department of Psychiatry & Behavioral Sciences, McGovern Medical School, The University of Texas Health Science at Houston, Houston, TX, USA
S
Salih Selek
Faillace Department of Psychiatry & Behavioral Sciences, McGovern Medical School, The University of Texas Health Science at Houston, Houston, TX, USA
L
Lokesh Shahani
Faillace Department of Psychiatry & Behavioral Sciences, McGovern Medical School, The University of Texas Health Science at Houston, Houston, TX, USA
Rodrigo Machado-Vieira
Rodrigo Machado-Vieira
Professor of Psychiatry, Department of Psychiatry, UTHealth, Houston, TX
Bipolar disorderdepressionmood disorderspsychiatry
J
Jair Soares
Faillace Department of Psychiatry & Behavioral Sciences, McGovern Medical School, The University of Texas Health Science at Houston, Houston, TX, USA
H
Hua Xu
Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA
H
Hongfang Liu
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA; Faillace Department of Psychiatry & Behavioral Sciences, McGovern Medical School, The University of Texas Health Science at Houston, Houston, TX, USA