π€ AI Summary
This study addresses the lack of in-depth analysis of the core mechanisms of supervised contrastive learning (SupCon) in deepfake audio detection, particularly regarding similarity metrics and negative sampling strategies. Building upon the wav2vec2 XLS-R (300M) backbone, the authors propose a two-stage training pipeline and conduct the first controlled ablation study of SupCon for this task, systematically evaluating cosine versus angular similarity and a warmed-up global cross-batch negative queue. Results demonstrate that angular similarity reduces reliance on large negative batches, while cosine similarity combined with a delayed queue substantially enhances generalization. Trained on ASVspoof 2019 LA, the model achieves state-of-the-art performance across multiple evaluation datasets, with an ITW EER of 8.29% and a pooled EER of 4.44%, confirming the methodβs effectiveness and robustness.
π Abstract
Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.