๐ค AI Summary
This study addresses the opaque symptom representation mechanisms of large language models (LLMs)โparticularly GPT-4โin depression assessment. We present the first systematic decoding of LLM-derived depressive symptom structure using a machine behavioral analysis framework integrating item response theory, symptom correlation modeling, expert consensus evaluation, and validation against large-scale self-report datasets. Our analysis uncovers GPT-4โs clinical reasoning patterns and cognitive biases: it significantly underestimates suicidal ideation while overestimating psychomotor symptoms, prompting a novel hypothesis on symptom causality direction. The model demonstrates high convergent validity (self-report *r* = 0.71; expert-rated *r* = 0.81) and strong internal consistency. This work establishes a methodological foundation and empirical evidence for interpretable, clinically aligned LLM evaluation in mental health applications.
๐ Abstract
Use of large language models such as ChatGPT (GPT-4) for mental health support has grown rapidly, emerging as a promising route to assess and help people with mood disorders, like depression. However, we have a limited understanding of GPT-4's schema of mental disorders, that is, how it internally associates and interprets symptoms. In this work, we leveraged contemporary measurement theory to decode how GPT-4 interrelates depressive symptoms to inform both clinical utility and theoretical understanding. We found GPT-4's assessment of depression: (a) had high overall convergent validity (r = .71 with self-report on 955 samples, and r = .81 with experts judgments on 209 samples); (b) had moderately high internal consistency (symptom inter-correlates r = .23 to .78 ) that largely aligned with literature and self-report; except that GPT-4 (c) underemphasized suicidality's -- and overemphasized psychomotor's -- relationship with other symptoms, and (d) had symptom inference patterns that suggest nuanced hypotheses (e.g. sleep and fatigue are influenced by most other symptoms while feelings of worthlessness/guilt is mostly influenced by depressed mood).