GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility of deploying GPT-4 for early depression screening in resource-constrained clinical settings, where access to mental health specialists is limited. Method: We formulate a binary depression classification task on clinical interview transcripts and systematically evaluate the joint impact of prompt complexity (simple vs. structured) and temperature parameter (0.0–0.3) on model performance and reliability. Contribution/Results: We present the first empirical evidence that low temperature (0.0–0.2) combined with highly structured prompts significantly improves model stability and classification consistency—measured by accuracy and F1-score—while performance degrades abruptly at temperature ≥0.3, underscoring the necessity of stringent parameter calibration for clinical LLM deployment. Under optimal configuration, GPT-4 achieves robust discriminative capability, establishing a reproducible prompt engineering framework and parameter-tuning benchmark for AI-assisted psychiatric screening.

Technology Category

Application Category

📝 Abstract
Depression has impacted millions of people worldwide and has become one of the most prevalent mental disorders. Early mental disorder detection can lead to cost savings for public health agencies and avoid the onset of other major comorbidities. Additionally, the shortage of specialized personnel is a critical issue because clinical depression diagnosis is highly dependent on expert professionals and is time consuming. In this study, we explore the use of GPT-4 for clinical depression assessment based on transcript analysis. We examine the model's ability to classify patient interviews into binary categories: depressed and not depressed. A comparative analysis is conducted considering prompt complexity (e.g., using both simple and complex prompts) as well as varied temperature settings to assess the impact of prompt complexity and randomness on the model's performance. Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations, with optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. However, beyond a certain threshold (temperature>= 0.3), the relationship between randomness and performance becomes unpredictable, diminishing the gains from prompt complexity. These findings suggest that, while GPT-4 shows promise for clinical assessment, the configuration of the prompts and model parameters requires careful calibration to ensure consistent results. This preliminary study contributes to understanding the dynamics between prompt engineering and large language models, offering insights for future development of AI-powered tools in clinical settings.
Problem

Research questions and friction points this paper is trying to address.

GPT-4
Depression Diagnosis
AI-assisted Healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4
Depression Assessment
AI in Mental Health
🔎 Similar Papers
No similar papers found.
G
Giuliano Lorenzoni
University of Waterloo
P
P. Velmovitsky
Centre for Digital Therapeutics
Paulo Alencar
Paulo Alencar
Associate Director, CSG; Research Professor, University of Waterloo
software engineeringformal methodsweb engineeringmobile applicationscontext-aware computing
D
Donald Cowan
University of Waterloo