🤖 AI Summary
To address the proliferation of generative audio attacks and the poor generalization of existing deepfake detection methods, this paper proposes a generalizable detection framework based on semantic-agnostic universal audio representations. Methodologically, it abandons semantic-dependent and handcrafted features, and instead systematically employs self-supervised speech representation models—namely TRILL and TRILLsson—to extract robust, non-semantic audio embeddings, which are then fed into a lightweight classifier for spoof detection. Experiments demonstrate that the approach achieves in-domain performance comparable to state-of-the-art (SOTA) methods, while significantly outperforming them in cross-domain and public benchmark settings—including ASVspoof 2019 Logical Access and In-the-Wild—especially under unknown synthesis algorithms and channel distortions. The core contribution lies in empirically validating and establishing the critical role of semantic-agnostic universal representations in enhancing the robustness and generalizability of audio forgery detection.
📝 Abstract
Rapid advancements in generative modeling have made synthetic audio generation easy, making speech-based services vulnerable to spoofing attacks. Consequently, there is a dire need for robust countermeasures more than ever. Existing solutions for deepfake detection are often criticized for lacking generalizability and fail drastically when applied to real-world data. This study proposes a novel method for generalizable spoofing detection leveraging non-semantic universal audio representations. Extensive experiments have been performed to find suitable non-semantic features using TRILL and TRILLsson models. The results indicate that the proposed method achieves comparable performance on the in-domain test set while significantly outperforming state-of-the-art approaches on out-of-domain test sets. Notably, it demonstrates superior generalization on public-domain data, surpassing methods based on hand-crafted features, semantic embeddings, and end-to-end architectures.