🤖 AI Summary
This study addresses the tendency of existing automatic depression detection models to rely on fixed interviewer prompts in semi-structured clinical interviews rather than on patients’ authentic language, leading to inflated performance metrics and poor interpretability. Through systematic analysis of the ANDROIDS, DAIC-WOZ, and E-DAIC datasets, the authors employ speaker diarization and language modeling techniques to compare model performance with and without interviewer utterances. Their findings reveal a consistent, architecture-agnostic interviewer prompting bias across datasets. When models are constrained to use only participant speech, classification performance more accurately reflects genuine linguistic cues associated with depression, thereby avoiding spurious accuracy gains stemming from script consistency in interviewer prompts. The work underscores the necessity for future research to localize decision rationales by speaker and temporal context to enhance clinical credibility.
📝 Abstract
Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants' language.