🤖 AI Summary
This study presents the first systematic evaluation of the triadic reliability of large language models (LLMs) in mental health screening—assessing internal consistency, robustness to automatic speech recognition (ASR) errors, and fidelity to clinical evidence. Using zero-shot prompting, Phi-4, Gemma-2-9B, and Llama-3.1-8B estimated Hospital Anxiety and Depression Scale (HADS) scores from Whisper-family ASR transcripts of spoken responses collected from 111 participants. Results show that Phi-4 and Gemma-2-9B maintain high scoring consistency (ICC > 0.89) and keyword fidelity (>93%) across varying ASR conditions, whereas Llama-3.1-8B exhibits marked performance degradation at a 10% word error rate (ICC = 0.36; fidelity 77–81%). The findings reveal a significant disconnect between scoring consistency and evidential fidelity, offering critical empirical insights into the reliability of LLMs for clinical decision support.
📝 Abstract
LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.