Where are we in audio deepfake detection? A systematic analysis over generative and detection models

📅 2024-10-06
📈 Citations: 2
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
The increasing realism of AI-generated speech (TTS/VC) has outpaced detection capabilities, leading to delayed and insufficiently generalizable detection methods. Method: This paper introduces SONAR, the first systematic benchmark framework for evaluating detection performance across nine major speech synthesis platforms, uniformly assessing both conventional approaches and Speech Foundation Models (SFM). It establishes a standardized evaluation paradigm spanning diverse generative models and detector architectures. Contributions/Results: We identify SFMs’ strong cross-lingual generalization capability and demonstrate that few-shot fine-tuning significantly enhances customized detection. Experiments reveal that pretraining scale and data quality are critical determinants of detection robustness; models fine-tuned solely on English data maintain high accuracy across multilingual scenarios. Our work exposes fundamental generalization limitations of existing detectors and provides a reproducible, scalable evaluation infrastructure for trustworthy speech content governance.

Technology Category

Application Category

📝 Abstract
Recent advances in Text-to-Speech (TTS) and Voice-Conversion (VC) using generative Artificial Intelligence (AI) technology have made it possible to generate high-quality and realistic human-like audio. This poses growing challenges in distinguishing AI-synthesized speech from the genuine human voice and could raise concerns about misuse for impersonation, fraud, spreading misinformation, and scams. However, existing detection methods for AI-synthesized audio have not kept pace and often fail to generalize across diverse datasets. In this paper, we introduce SONAR, a synthetic AI-Audio Detection Framework and Benchmark, aiming to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset sourced from 9 diverse audio synthesis platforms, including leading TTS providers and state-of-the-art TTS models. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems. Through extensive experiments, (1) we reveal the limitations of existing detection methods and demonstrate that foundation models exhibit stronger generalization capabilities, likely due to their model size and the scale and quality of pretraining data. (2) Speech foundation models demonstrate robust cross-lingual generalization capabilities, maintaining strong performance across diverse languages despite being fine-tuned solely on English speech data. This finding also suggests that the primary challenges in audio deepfake detection are more closely tied to the realism and quality of synthetic audio rather than language-specific characteristics. (3) We explore the effectiveness and efficiency of few-shot fine-tuning in improving generalization, highlighting its potential for tailored applications, such as personalized detection systems for specific entities or individuals.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-synthesized audio amid rising misuse concerns
Evaluating generalization of detection methods across diverse datasets
Assessing cross-lingual robustness in audio deepfake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

SONAR framework benchmarks AI-audio detection
Foundation models enhance cross-lingual generalization
Few-shot fine-tuning improves detection efficiency
🔎 Similar Papers
2024-04-22arXiv.orgCitations: 25