Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Current audio deepfake detection (ADD) evaluation suffers from two critical limitations: (1) severe class imbalance across multi-synthesizer datasets, biasing Equal Error Rate (EER) estimates toward dominant synthesizers and undermining fairness; and (2) overly narrow coverage of real speech—typically limited to clean, read utterances—failing to reflect real-world acoustic and linguistic diversity. To address these issues, we propose the “Real-Speech Cross-Testing” evaluation framework. It integrates 150+ text-to-speech and voice-cloning synthesizers with nine diverse real-speech categories spanning environmental conditions, languages, speaking styles, and recording qualities. We design a cross-dataset evaluation protocol and introduce a balanced EER aggregation strategy to ensure fair, representative performance estimation. Our framework significantly enhances evaluation robustness and interpretability, systematically exposing model weaknesses under realistic conditions. Furthermore, we publicly release a large-scale, standardized benchmark to advance reproducible, practical ADD evaluation.

Technology Category

Application Category

📝 Abstract

Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.

Problem

Research questions and friction points this paper is trying to address.

Evaluates audio deepfake detection reliability issues

Addresses dataset diversity limitations in speech

Proposes balanced assessment for synthesizer performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bona fide cross-testing evaluation framework

Aggregates EERs for balanced assessments

Benchmarks 150 synthesizers across diverse speech

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey

2024-04-22arXiv.orgCitations: 25