Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of comorbid prediction of adolescent depression, suicidal ideation, and sleep disorders. We propose the first speech-based tri-modal digital phenotyping framework that jointly models textual (ASR transcripts), acoustic (prosodic and spectral) and phonetic biomarker features (pathology-relevant low-level cues), integrating large language model architectures with multi-task learning (MTL) to jointly discriminate all three mental health conditions. To capture disease progression, we further incorporate longitudinal sequence modeling. Our key methodological advance lies in the organic unification of tri-modal representation learning, MTL, and longitudinal modeling. Evaluated on the Depression Early Warning dataset, our framework achieves a balanced accuracy of 70.8%, significantly outperforming unimodal, single-task, and non-longitudinal baselines. This work establishes a novel paradigm for passive, temporally sensitive mental health risk surveillance.

Technology Category

Application Category

📝 Abstract

Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

Problem

Research questions and friction points this paper is trying to address.

Multimodal speech analysis for depression detection

Multi-task learning for comorbid mental health conditions

Longitudinal modeling of mental health progression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trimodal speech data for depression detection

Multi-task learning with multimodal LLM

Longitudinal analysis of clinical interactions

🔎 Similar Papers

Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges

2024-07-23arXiv.orgCitations: 1

Authors to Follow