Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the overreliance of current audio deepfake detection methods on large self-supervised learning (SSL) backbones and the lack of systematic evaluation of compact models in cross-domain scenarios. The authors propose a unified pairwise gated fusion detection framework and conduct controlled experiments using compact HuBERT and WavLM SSL backbones across 14 cross-domain benchmarks, revealing that multilingual pretraining is crucial for enhancing cross-domain robustness. They further introduce a perturbation-based test-time augmentation strategy coupled with an aleatoric uncertainty calibration protocol, which uncovers model calibration discrepancies invisible to standard metrics. Experimental results demonstrate that a multilingual HuBERT variant with only 100M parameters matches the cross-domain detection performance of substantially larger models and commercial systems, while iterative mHuBERT exhibits markedly superior calibration stability compared to WavLM variants.

Technology Category

Application Category

πŸ“ Abstract
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection
self-supervised learning
compact backbones
cross-domain robustness
model calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

compact SSL backbones
audio deepfake detection
cross-domain robustness
test-time augmentation
aleatoric uncertainty
πŸ”Ž Similar Papers
2024-04-22arXiv.orgCitations: 25