The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake speech detection faces dual challenges: fragmented feature representations and limited human interpretability. Existing low-level acoustic features inadequately capture semantic and perceptual distinctions, while generative models still struggle to perfectly replicate human affective patterns. To address this, we propose a novel emotion-guided training framework that unifies supervision via fine-grained emotion modeling—serving as a cross-modal alignment bridge between acoustic and linguistic representations. Our end-to-end detector jointly integrates ASR-derived features, emotion embeddings, multi-task learning, and contrastive representation alignment. Evaluated on FakeOrReal and In-the-Wild datasets, our method achieves absolute accuracy gains of 6% and 2%, respectively, and reduces equal-error rate (EER) by 4% and 1%. It maintains state-of-the-art performance on ASVspoof2019 while significantly improving robustness and interpretability.

Technology Category

Application Category

📝 Abstract
Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.
Problem

Research questions and friction points this paper is trying to address.

Unifies diverse acoustic features for speech deepfake detection
Addresses subtle distinctions between real and synthesized speech
Leverages emotion as a bridge to improve detection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages emotion as bridge between features
Unifies training strategy for all features
Improves accuracy with emotion-informed learning
🔎 Similar Papers