DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

πŸ“… 2026-01-01
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

192K/year
πŸ€– AI Summary
Existing depression detection models are prone to relying on spurious correlations between linguistic sentiment and diagnostic labels, leading to poor robustness in real-world scenarios such as feigned depression. To address this limitation, this work proposes DepFlowβ€”a three-stage conditional speech synthesis framework that, for the first time, disentangles depressive acoustic features from speaker identity and textual content, while enabling an interpretable, continuous control mechanism for depression severity. Leveraging an adversarially trained depressive acoustic encoder, a FiLM-modulated flow-matching TTS model, and a prototype-based severity mapping, DepFlow generates CDoA, an enhanced dataset featuring semantic-acoustic mismatches. Experiments show that training on CDoA improves macro-F1 scores by 9%, 12%, and 5% respectively across three mainstream depression detection models, substantially outperforming conventional data augmentation approaches.

Technology Category

Application Category

πŸ“ Abstract
Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.
Problem

Research questions and friction points this paper is trying to address.

semantic bias
depression detection
camouflaged depression
speech biomarker
model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled speech generation
semantic bias mitigation
depression-conditioned TTS
flow-matching
acoustic-semantic mismatch
πŸ”Ž Similar Papers
No similar papers found.