MDD-Net: Multimodal Depression Detection through Mutual Transformer

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses automatic depression detection in social media videos by proposing a mutual Transformer-based multimodal fusion method. The approach separately extracts deep audio and facial video features, then models fine-grained acoustic-visual interactions via bidirectional cross-modal attention, enabling end-to-end joint optimization of feature fusion and classification. Experiments on the D-Vlog dataset demonstrate that the proposed model achieves an F1-score of 78.24%, outperforming the previous state-of-the-art by 17.37 percentage points—substantially validating the effectiveness and robustness of the mutual Transformer for cross-modal depressive representation learning. The core contributions are threefold: (i) the first application of the mutual Transformer to depression detection; (ii) dynamic, interpretable multimodal feature alignment; and (iii) discriminative, synergistic multimodal feature enhancement through cross-modal interaction modeling.

Technology Category

Application Category

📝 Abstract

Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting depression using multimodal social media data

Improving feature extraction with mutual transformers

Enhancing accuracy over existing depression detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses acoustic and visual data from social media

Employs mutual transformers for feature fusion

Outperforms state-of-the-art by 17.37% F1-Score

🔎 Similar Papers

No similar papers found.