DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing depression detection models are prone to relying on spurious correlations between linguistic sentiment and diagnostic labels, leading to poor robustness in real-world scenarios such as feigned depression. To address this limitation, this work proposes DepFlow—a three-stage conditional speech synthesis framework that, for the first time, disentangles depressive acoustic features from speaker identity and textual content, while enabling an interpretable, continuous control mechanism for depression severity. Leveraging an adversarially trained depressive acoustic encoder, a FiLM-modulated flow-matching TTS model, and a prototype-based severity mapping, DepFlow generates CDoA, an enhanced dataset featuring semantic-acoustic mismatches. Experiments show that training on CDoA improves macro-F1 scores by 9%, 12%, and 5% respectively across three mainstream depression detection models, substantially outperforming conventional data augmentation approaches.

Technology Category

Application Category

📝 Abstract

Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

Problem

Research questions and friction points this paper is trying to address.

semantic bias

depression detection

camouflaged depression

speech biomarker

model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled speech generation

semantic bias mitigation

depression-conditioned TTS