π€ AI Summary
Existing depression detection models are prone to relying on spurious correlations between linguistic sentiment and diagnostic labels, leading to poor robustness in real-world scenarios such as feigned depression. To address this limitation, this work proposes DepFlowβa three-stage conditional speech synthesis framework that, for the first time, disentangles depressive acoustic features from speaker identity and textual content, while enabling an interpretable, continuous control mechanism for depression severity. Leveraging an adversarially trained depressive acoustic encoder, a FiLM-modulated flow-matching TTS model, and a prototype-based severity mapping, DepFlow generates CDoA, an enhanced dataset featuring semantic-acoustic mismatches. Experiments show that training on CDoA improves macro-F1 scores by 9%, 12%, and 5% respectively across three mainstream depression detection models, substantially outperforming conventional data augmentation approaches.
π Abstract
Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.