Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study investigates the capability of large language models (LLMs) for zero-shot multimodal (text + audio) diagnosis of depression and PTSD. Leveraging the E-DAIC dataset, we systematically evaluate single-modal and cross-modal fusion performance of models including Gemini 1.5 Pro and GPT-4o mini. Methodologically, we propose two novel metrics—Modal Superiority Score and Disagreement Resolution Score—to quantify multimodal synergy, and employ zero-shot prompt engineering—without fine-tuning—to enable end-to-end cross-modal reasoning. Results demonstrate that Gemini 1.5 Pro achieves an F1-score of 0.67 and balanced accuracy of 77.4% under multimodal fusion, outperforming the best single-modal baseline by 2.7–3.1%. This work constitutes the first empirical validation that LLMs can effectively and robustly perform early mental health screening across unseen diagnostic categories via zero-shot cross-modal integration.

Technology Category

Application Category

📝 Abstract

Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of Large Language Models (LLMs) in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.

Problem

Research questions and friction points this paper is trying to address.

Detecting depression and PTSD using text and audio modalities.

Evaluating LLMs performance in multimodal mental health diagnostics.

Enhancing diagnostic accuracy by integrating text and audio inputs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines text and audio for mental health diagnosis

Uses custom metrics to evaluate multimodal performance

Achieves best results with zero-shot inference

🔎 Similar Papers

Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study

2024-03-19JMIR Formative ResearchCitations: 1