🤖 AI Summary
This work addresses the forensic provenance problem of speech deepfakes by proposing the first forensic similarity analysis method for deepfake audio, designed to determine whether two audio samples originate from the same generative model. The method employs a two-stage deep network: first extracting robust forensic features using a pretrained deepfake detector, then computing a consistency score via a lightweight similarity network. Crucially, it requires no assumptions about or training on specific forgery artifacts, enabling strong generalization to unseen deepfake techniques. Evaluated on source verification, it significantly outperforms baseline methods and supports extended applications such as splice detection. Its core contribution lies in pioneering the forensic similarity paradigm for speech deepfakes—overcoming traditional limitations that rely on known artifacts or model priors—while ensuring robustness, adaptability, and practical utility.
📝 Abstract
In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.