DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative AI has enabled highly realistic audio-visual DeepFakes, posing severe security and ethical risks. Existing audio-video DeepFake detection methods suffer from dataset biases (e.g., the “silence shortcut” in FakeAVCeleb), model reliance on spurious correlations, and inconsistent evaluation protocols—leading to unreliable benchmarks and poor reproducibility. This work systematically diagnoses three fundamental flaws in current benchmarks and proposes: (1) DeepSpeak v1, the first standardized evaluation protocol for audio-visual DeepFake detection; (2) SIMBA, a lightweight, efficient multimodal baseline; (3) an audio shortcut identification and suppression mechanism; and (4) a redefined evaluation paradigm for FakeAVCeleb. Experiments demonstrate significantly improved assessment reliability on FakeAVCeleb; SIMBA matches state-of-the-art performance across multiple benchmarks; and the study establishes the first integrated “diagnosis–mitigation” benchmarking framework for audio-visual DeepFake detection.

Technology Category

Application Category

📝 Abstract
Generative AI advances rapidly, allowing the creation of very realistic manipulated video and audio. This progress presents a significant security and ethical threat, as malicious users can exploit DeepFake techniques to spread misinformation. Recent DeepFake detection approaches explore the multimodal (audio-video) threat scenario. In particular, there is a lack of reproducibility and critical issues with existing datasets - such as the recently uncovered silence shortcut in the widely used FakeAVCeleb dataset. Considering the importance of this topic, we aim to gain a deeper understanding of the key issues affecting benchmarking in audio-video DeepFake detection. We examine these challenges through the lens of the three core benchmarking pillars: datasets, detection methods, and evaluation protocols. To address these issues, we spotlight the recent DeepSpeak v1 dataset and are the first to propose an evaluation protocol and benchmark it using SOTA models. We introduce SImple Multimodal BAseline (SIMBA), a competitive yet minimalistic approach that enables the exploration of diverse design choices. We also deepen insights into the issue of audio shortcuts and present a promising mitigation strategy. Finally, we analyze and enhance the evaluation scheme on the widely used FakeAVCeleb dataset. Our findings offer a way forward in the complex area of audio-video DeepFake detection.
Problem

Research questions and friction points this paper is trying to address.

Addressing reproducibility and dataset issues in audio-video DeepFake detection
Developing a benchmark protocol for evaluating multimodal DeepFake detection methods
Mitigating audio shortcuts and enhancing evaluation in existing DeepFake datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SIMBA for multimodal DeepFake detection
Proposes evaluation protocol for DeepSpeak v1 dataset
Mitigates audio shortcuts in FakeAVCeleb dataset
🔎 Similar Papers
No similar papers found.
M
Marcel Klemt
TU Darmstadt & Hessian.AI
C
Carlotta Segna
TU Darmstadt & Hessian.AI
Anna Rohrbach
Anna Rohrbach
Professor, TU Darmstadt, Germany
Vision and LanguageArtificial IntelligenceMultimodal Grounded Learning