Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

πŸ“… 2026-04-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

214K/year
πŸ€– AI Summary
This study addresses the lack of in-depth analysis of the core mechanisms of supervised contrastive learning (SupCon) in deepfake audio detection, particularly regarding similarity metrics and negative sampling strategies. Building upon the wav2vec2 XLS-R (300M) backbone, the authors propose a two-stage training pipeline and conduct the first controlled ablation study of SupCon for this task, systematically evaluating cosine versus angular similarity and a warmed-up global cross-batch negative queue. Results demonstrate that angular similarity reduces reliance on large negative batches, while cosine similarity combined with a delayed queue substantially enhances generalization. Trained on ASVspoof 2019 LA, the model achieves state-of-the-art performance across multiple evaluation datasets, with an ITW EER of 8.29% and a pooled EER of 4.44%, confirming the method’s effectiveness and robustness.
πŸ“ Abstract
Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.
Problem

Research questions and friction points this paper is trying to address.

Supervised Contrastive Learning
Deepfake Audio Detection
Similarity Choice
Negative Scaling
Audio Representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

supervised contrastive learning
similarity choice
negative scaling
deepfake audio detection
cross-batch queue
πŸ”Ž Similar Papers
2024-04-22arXiv.orgCitations: 25