MDD-Net: Multimodal Depression Detection through Mutual Transformer

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses automatic depression detection in social media videos by proposing a mutual Transformer-based multimodal fusion method. The approach separately extracts deep audio and facial video features, then models fine-grained acoustic-visual interactions via bidirectional cross-modal attention, enabling end-to-end joint optimization of feature fusion and classification. Experiments on the D-Vlog dataset demonstrate that the proposed model achieves an F1-score of 78.24%, outperforming the previous state-of-the-art by 17.37 percentage points—substantially validating the effectiveness and robustness of the mutual Transformer for cross-modal depressive representation learning. The core contributions are threefold: (i) the first application of the mutual Transformer to depression detection; (ii) dynamic, interpretable multimodal feature alignment; and (iii) discriminative, synergistic multimodal feature enhancement through cross-modal interaction modeling.

Technology Category

Application Category

📝 Abstract
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting depression using multimodal social media data
Improving feature extraction with mutual transformers
Enhancing accuracy over existing depression detection methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses acoustic and visual data from social media
Employs mutual transformers for feature fusion
Outperforms state-of-the-art by 17.37% F1-Score
🔎 Similar Papers
No similar papers found.
M
Md Rezwanul Haque
Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada
Md. Milon Islam
Md. Milon Islam
University of Waterloo
Multimodal Machine LearningAI for HealthLarge Language Models
S M Taslim Uddin Raju
S M Taslim Uddin Raju
MASc in Computer Science (Specialized in AI)
Machine LearningMedical ImagingDeep LearningBiomedical Engineering.
Hamdi Altaheri
Hamdi Altaheri
PhD, Post Doctoral Scholar at University of Waterloo
Deep LearningFoundation ModelsSelf-Supervised Learning
Lobna Nassar
Lobna Nassar
Ph.D. Candidate and research assistant at University of Waterloo, ON, CANADA
Information RetrievalCrowdsourcingVANET
F
Fakhri Karray
Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada, and Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates