Deepfake Audio Detection Using Self-supervised Fusion Representations

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenge of detecting component-level deepfake audio, where speech and background sounds may be independently manipulated. The authors propose a dual-branch detection framework that leverages pre-trained models XLS-R and BEATs to extract representations of speech and environmental audio, respectively. A matching head models the discrepancy between these two modalities, while multi-head cross-attention enables effective cross-modal interaction. The fused features are then processed through statistical normalization, residual connections, and layer normalization before being fed into an AASIST classifier to produce three-class predictions. Experimental results demonstrate that the proposed method achieves an F1 score of 70.20% and an environmental equal error rate (EER) of 16.54% on the test set, significantly outperforming baseline systems and substantially improving detection performance for component-level forged audio.
📝 Abstract
This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.
Problem

Research questions and friction points this paper is trying to address.

Deepfake Audio Detection
Component-level Spoofing
Speech Manipulation
Environmental Sound Manipulation
Audio Forensics
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised fusion
dual-branch framework
cross-attention
environment-aware deepfake detection
matching head
🔎 Similar Papers
2024-04-22arXiv.orgCitations: 25
K
Khalid Zaman
Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology
Q
Qixuan Huang
Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology
M
Muhammad Uzair
Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology
Masashi Unoki
Masashi Unoki
JAIST
Auditory modelspeech signal processing