MMFformer: Multimodal Fusion Transformer Network for Depression Detection

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Early depression detection relies heavily on subjective clinical assessments, while automated approaches face challenges in modeling multimodal temporal dynamics and uncovering cross-modal correlations. To address these issues, we propose MMFformer—a novel multimodal Transformer architecture incorporating residual connections. It jointly extracts spatial features from video and models temporal dynamics from audio, and introduces a dual-stage fusion strategy—comprising intermediate-layer and late-stage fusion—to explicitly capture high-order spatiotemporal dependencies across modalities. Evaluated on the D-Vlog and LMVD benchmarks, MMFformer achieves absolute F1-score improvements of 13.92% and 7.74%, respectively, outperforming state-of-the-art methods by a significant margin. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting depression via multimodal social media data
Fusing temporal and spatial features from diverse modalities
Improving accuracy in depression detection over existing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer network captures spatial video features
Transformer encoder models temporal audio dynamics
Late-intermediate fusion finds intermodal correlations
🔎 Similar Papers
No similar papers found.
M
Md Rezwanul Haque
Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada
Md. Milon Islam
Md. Milon Islam
University of Waterloo
Multimodal Machine LearningAI for HealthLarge Language Models
S M Taslim Uddin Raju
S M Taslim Uddin Raju
MASc in Computer Science (Specialized in AI)
Machine LearningMedical ImagingDeep LearningBiomedical Engineering.
Hamdi Altaheri
Hamdi Altaheri
PhD, Post Doctoral Scholar at University of Waterloo
Deep LearningFoundation ModelsSelf-Supervised Learning
Lobna Nassar
Lobna Nassar
Ph.D. Candidate and research assistant at University of Waterloo, ON, CANADA
Information RetrievalCrowdsourcingVANET
F
Fakhri Karray
Centre for Pattern Analysis and Machine Intelligence, Department of Electrical and Computer Engineering, University of Waterloo, N2L 3G1, Ontario, Canada, and Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates