Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses the limitations of existing deepfake detection research, which predominantly focuses on single-speaker scenarios and struggles to handle novel forgery threats in multi-speaker dialogues. To bridge this gap, the work proposes the first classification framework specifically designed for deepfakes in multi-speaker conversational settings and introduces MsCADD—the first public dataset dedicated to text-to-speech (TTS)-generated two-person dialogue deepfakes. The dataset is constructed using VITS and SoundStorm-based NotebookLM to synthesize realistic dialogue audio. Baseline detection evaluations are conducted with LFCC-LCNN, RawNet2, and Wav2Vec 2.0. Experimental results reveal that current models exhibit limited performance across key metrics including F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR), underscoring the significant challenges posed by this task and establishing a foundation for future research.

Technology Category

Application Category

📝 Abstract

The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.

Problem

Research questions and friction points this paper is trying to address.

multi-speaker audio deepfake

conversational deepfake

audio forgery

deepfake detection

text-to-speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-speaker audio deepfake

conversational deepfake dataset

audio deepfake taxonomy