SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake audio detectors suffer from poor out-of-distribution generalization, primarily due to spectral bias—models over-rely on low-frequency semantic cues while neglecting high-frequency artifacts. To address this, we propose SONAR, the first unsupervised, architecture-agnostic frequency-domain disentanglement framework that explicitly treats high-frequency residuals as the core learning signal. SONAR employs a learnable high-pass filter (SRM), frequency-cross attention, and a frequency-aware Jensen–Shannon contrastive loss to jointly model and disentangle low-frequency semantic manifolds from high-frequency distortion manifolds. Evaluated on ASVspoof 2021 and real-world datasets, SONAR achieves state-of-the-art detection performance, converges four times faster than baseline methods, and significantly improves cross-distribution robustness and generalization.

Technology Category

Application Category

📝 Abstract
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.
Problem

Research questions and friction points this paper is trying to address.

Detecting deepfake audio that struggles with out-of-distribution generalization
Addressing spectral bias where networks miss high-frequency artifacts
Creating generalizable detection by separating natural and synthetic audio signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles audio into complementary frequency representations
Uses cross-attention to reunite long- and short-range dependencies
Employs frequency-aware contrastive loss to sharpen decision boundaries
🔎 Similar Papers
No similar papers found.