π€ AI Summary
This study addresses the limitations of existing deepfake detection research, which predominantly focuses on single-speaker scenarios and struggles to handle novel forgery threats in multi-speaker dialogues. To bridge this gap, the work proposes the first classification framework specifically designed for deepfakes in multi-speaker conversational settings and introduces MsCADDβthe first public dataset dedicated to text-to-speech (TTS)-generated two-person dialogue deepfakes. The dataset is constructed using VITS and SoundStorm-based NotebookLM to synthesize realistic dialogue audio. Baseline detection evaluations are conducted with LFCC-LCNN, RawNet2, and Wav2Vec 2.0. Experimental results reveal that current models exhibit limited performance across key metrics including F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR), underscoring the significant challenges posed by this task and establishing a foundation for future research.
π Abstract
The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.