Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of comorbid prediction of adolescent depression, suicidal ideation, and sleep disorders. We propose the first speech-based tri-modal digital phenotyping framework that jointly models textual (ASR transcripts), acoustic (prosodic and spectral) and phonetic biomarker features (pathology-relevant low-level cues), integrating large language model architectures with multi-task learning (MTL) to jointly discriminate all three mental health conditions. To capture disease progression, we further incorporate longitudinal sequence modeling. Our key methodological advance lies in the organic unification of tri-modal representation learning, MTL, and longitudinal modeling. Evaluated on the Depression Early Warning dataset, our framework achieves a balanced accuracy of 70.8%, significantly outperforming unimodal, single-task, and non-longitudinal baselines. This work establishes a novel paradigm for passive, temporally sensitive mental health risk surveillance.

Technology Category

Application Category

📝 Abstract
Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.
Problem

Research questions and friction points this paper is trying to address.

Multimodal speech analysis for depression detection
Multi-task learning for comorbid mental health conditions
Longitudinal modeling of mental health progression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trimodal speech data for depression detection
Multi-task learning with multimodal LLM
Longitudinal analysis of clinical interactions
M
Mai Ali
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
C
Christopher Lucasius
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
T
Tanmay P. Patel
Division of Engineering Science, University of Toronto, Toronto, Canada
M
Madison Aitken
Cundill Centre for Child and Youth Depression, Centre for Addiction and Mental Health, Toronto, Canada; Department of Psychology, York University, Toronto, Canada
J
Jacob A. S. Vorstman
The Hospital for Sick Children, Toronto, ON, Canada
P
Peter Szatmari
Cundill Centre for Child and Youth Depression, Centre for Addiction and Mental Health, Toronto, Canada; Department of Psychiatry, University of Toronto, Toronto, Canada
M
Marco Battaglia
Department of Psychiatry, University of Toronto, Toronto, Canada
Deepa Kundur
Deepa Kundur
Canada Research Chair in Cybersecurity of Intelligent Critical Infrastructure, University of Toronto
Cyber-Physical SecuritySmart GridSmart Grid SecurityMental Health InformaticsMultimedia