TwinShift: Benchmarking Audio Deepfake Detection across Synthesizer and Speaker Shifts

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio deepfake detectors suffer from poor generalization to unseen synthesis methods and speakers, severely undermining their reliability in real-world deployment. To address this, we propose TWINSHIFT—a novel benchmark that explicitly decouples synthetic model identity from speaker identity, featuring six state-of-the-art generative systems and mutually exclusive speaker sets. It introduces a dual-transfer evaluation protocol: cross-synthesizer and cross-speaker zero-shot detection. Systematic experiments reveal substantial performance degradation (average drop of 32.7%) under strict zero-shot conditions, precisely exposing critical robustness blind spots. TWINSHIFT provides a reproducible, high-stakes standardized testbed and shifts the detection paradigm away from the unrealistic i.i.d. assumption toward strong generalization—essential for practical deployment. By rigorously isolating synthesis-model and speaker variability, it establishes a foundational evaluation framework and concrete optimization directions for next-generation robust audio deepfake detection systems.

Technology Category

Application Category

📝 Abstract
Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.
Problem

Research questions and friction points this paper is trying to address.

Evaluating audio deepfake detection robustness under unseen conditions
Assessing detector generalization across different synthesis methods and speakers
Addressing reliability gaps in current audio deepfake detection systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates detection under unseen conditions
Uses six synthesis systems with disjoint speaker sets
Assesses generalization across model and speaker changes
🔎 Similar Papers
No similar papers found.
J
Jiyoung Hong
Ewha Womans University
Y
Yoonseo Chung
Ewha Womans University
S
Seungyeon Oh
Ewha Womans University
J
Juntae Kim
SK Telecom, Seoul, Republic of Korea
Jiyoung Lee
Jiyoung Lee
Assistant Professor, Ewha Womans University
Multimodal LearningComputer VisionMachine Learning
Sookyung Kim
Sookyung Kim
PARC (Palo Alto Research Center)
LLM-post trainingReinforcement LearningAI driven drug discoveryClimate AI
H
Hyunsoo Cho
Ewha Womans University