๐ค AI Summary
Current audio deepfake detectors exhibit poor generalization against unseen synthesis methods (zero-day attacks), while fine-tuning models fails to meet rapid-response requirements. This paper proposes a training-free zero-day detection frameworkโthe first to integrate knowledge retrieval with voiceprint contour matching. It leverages pretrained models to extract speech representations, performs similarity search over a large-scale retrieval pool, and fuses multi-granularity voiceprint features for robust matching. By eliminating the need for model retraining, the approach enables immediate deployment and scalable updates. On the DeepFake-Eval-2024 benchmark, it achieves detection performance comparable to fully fine-tuned models. Ablation studies confirm the critical impact of retrieval pool size and voiceprint attribute design on accuracy.
๐ Abstract
Modern audio deepfake detectors using foundation models and large training datasets have achieved promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches against such attacks require fine-tuning the detectors, which can be problematic when prompt response is required. This study introduces a training-free framework for zero-day audio deepfake detection based on knowledge representations, retrieval augmentation, and voice profile matching. Based on the framework, we propose simple yet effective knowledge retrieval and ensemble methods that achieve performance comparable to fine-tuned models on DeepFake-Eval-2024, without any additional model-wise training. We also conduct ablation studies on retrieval pool size and voice profile attributes, validating their relevance to the system efficacy.