A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of self-supervised speech models for audio deepfake detection—a security-critical task—by introducing the Spoof-SUPERB benchmark. For the first time, this benchmark integrates spoofing detection into the standardized SUPERB-style evaluation framework, enabling a comprehensive assessment of 20 prominent self-supervised models spanning generative, discriminative, and spectrogram-based architectures, including XLS-R, UniSpeech-SAT, and WavLM Large. The study evaluates model robustness under in-domain, out-of-domain, and acoustically degraded conditions. Results demonstrate that large-scale discriminative models consistently outperform other architectures across multilingual and multi-perturbation scenarios, offering reliable guidance for selecting effective representations in voice security applications.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection

self-supervised speech models

benchmark

speech security

spoofing

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

audio deepfake detection

Spoof-SUPERB