Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical shortage of psychiatric specialists by investigating the clinical applicability of large language models (LLMs) for auxiliary diagnosis of major depressive disorder and generalized anxiety disorder. Methodologically, it strictly adheres to standardized PHQ-9 and GAD-7 assessment protocols and introduces a novel “diagnosis-protocol-driven” paradigm integrating prompt engineering with domain-specific fine-tuning—including instruction tuning (MentalLlama, Llama), multi-model, multi-strategy reasoning (GPT-3.5/4o, Llama-3.1-8B, Mixtral-8×7B), and an expert-annotated evaluation framework. The core contribution is the first systematic validation of LLMs’ clinical alignment in structured psychiatric diagnosis. The best-performing model achieves 92% diagnostic agreement with board-certified psychiatrists (Cohen’s κ = 0.85), with sensitivity of 89% and specificity of 94%, significantly outperforming baseline prompting approaches and demonstrating strong potential for clinical deployment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Mental Health Diagnosis
Healthcare Resource Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Psychiatric Diagnosis
Fine-tuning Techniques
🔎 Similar Papers