🤖 AI Summary
This work addresses the lack of systematic evaluation of self-supervised speech models for audio deepfake detection—a security-critical task—by introducing the Spoof-SUPERB benchmark. For the first time, this benchmark integrates spoofing detection into the standardized SUPERB-style evaluation framework, enabling a comprehensive assessment of 20 prominent self-supervised models spanning generative, discriminative, and spectrogram-based architectures, including XLS-R, UniSpeech-SAT, and WavLM Large. The study evaluates model robustness under in-domain, out-of-domain, and acoustically degraded conditions. Results demonstrate that large-scale discriminative models consistently outperform other architectures across multilingual and multi-perturbation scenarios, offering reliable guidance for selecting effective representations in voice security applications.
📝 Abstract
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.