Generalizable Detection of Audio Deepfakes

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address the limited generalization capability of audio deepfake detection models, this paper systematically investigates the synergistic optimization of pretrained backbone networks, data augmentation strategies, and loss functions. We propose lightweight, efficient detection architectures built upon Wav2Vec2, WavLM, and Whisper, integrated with joint time-domain and frequency-domain augmentations and a contrastive learning–driven loss design. Cross-scenario evaluations on ASVspoof2019–2021 and a private dataset demonstrate that our approach significantly improves robustness against unseen attack types and recording conditions. A single proposed model outperforms the winning system of the ASVspoof5 Challenge. This work establishes a reproducible, generalization-enhanced paradigm for audio deepfake detection and provides a standardized experimental benchmark framework.

Technology Category

Application Category

📝 Abstract

In this paper, we present our comprehensive study aimed at enhancing the generalization capabilities of audio deepfake detection models. We investigate the performance of various pre-trained backbones, including Wav2Vec2, WavLM, and Whisper, across a diverse set of datasets, including those from the ASVspoof challenges and additional sources. Our experiments focus on the effects of different data augmentation strategies and loss functions on model performance. The results of our research demonstrate substantial enhancements in the generalization capabilities of audio deepfake detection models, surpassing the performance of the top-ranked single system in the ASVspoof 5 Challenge. This study contributes valuable insights into the optimization of audio models for more robust deepfake detection and facilitates future research in this critical area.

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in audio deepfake detection models

Evaluating pre-trained backbones across diverse datasets

Optimizing data augmentation and loss functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained backbones like Wav2Vec2, WavLM, Whisper

Explores data augmentation and loss functions

Enhances generalization in audio deepfake detection

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey