🤖 AI Summary
This study addresses the limitation of existing speech-based depression detection methods, which typically assume that depression-related features are uniformly distributed across utterances and thereby overlook their inherent sparsity. To overcome this, the authors propose a bimodal network incorporating an Adaptive Cross-Modal Gating (ACMG) mechanism that dynamically reweights frames in both acoustic and textual modalities to selectively attend to depression-relevant segments. This work introduces ACMG for the first time, integrating it with attention mechanisms to perform frame-level feature reweighting, effectively capturing clinically meaningful yet sparsely distributed patterns—such as low-energy speech segments and negatively valenced lexical content. Experimental results demonstrate that the proposed model outperforms baseline approaches, and visualization analyses confirm ACMG’s capability to automatically focus on critical depression indicators.
📝 Abstract
Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly distributed. However, most existing methods treat all frames equally, assuming depression-related information is uniformly distributed and thus overlooking this sparsity. To address this issue, we proposes a depression detection network based on Adaptive Cross-Modal Gating (ACMG) that adaptively reassigns frame-level weights across both modalities, enabling selective attention to depression-related segments. Experimental results show that the depression detection system with ACMG outperforms baselines without it. Visualization analyses further confirm that ACMG automatically attends to clinically meaningful patterns, including low-energy acoustic segments and textual segments containing negative sentiments.