SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization and poor robustness of existing multimodal deepfake detection methods, which heavily rely on synthetic training data and struggle with unseen forgery techniques. To overcome these limitations, we propose a self-supervised audio-visual deepfake detection framework trained exclusively on real videos. Our approach generates identity-preserving, region-aware pseudo-deepfake samples online to learn multi-granularity visual artifacts, while simultaneously modeling the temporal alignment between lip movements and speech to capture cross-modal inconsistencies. Notably, the method requires no forged data during training, yet achieves strong in-domain detection performance and remarkable cross-dataset generalization, as demonstrated by state-of-the-art results on both FakeAVCeleb and AV-LipSync-TIMIT benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.
Problem

Research questions and friction points this paper is trying to address.

multimodal deepfakes
visual artifacts
audio-visual misalignment
dataset bias
cross-modal inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
audio-visual deepfake detection
visual artifacts
audio-visual misalignment
pseudo-manipulation
🔎 Similar Papers
No similar papers found.
S
Sahibzada Adil Shahzad
Social Networks and Human-Centered Computing Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica; and Department of Computer Science, National Chengchi University, Taipei 11529, Taiwan
A
Ammarah Hashmi
Social Networks and Human-Centered Computing Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan; and Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu 300044, Taiwan
Junichi Yamagishi
Junichi Yamagishi
National Institute of Informatics, Tokyo, Japan
Speech processingSpeech synthesisBiometricsDeepfakesMultimedia Forensics
Y
Yusuke Yasuda
National Institute of Informatics, Tokyo 101-8430, Japan
Yu Tsao
Yu Tsao
Research Fellow (Professor), Deputy Director, CITI, Academia Sinica
Assistive Oral Communication TechnologiesSpeech EnhancementVoice ConversionSpeech Assessment
C
Chia-Wen Lin
Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu 300044, Taiwan
Yan-Tsung Peng
Yan-Tsung Peng
National Chengchi University
Hsin-Min Wang
Hsin-Min Wang
Research Fellow/Professor, Institute of Information Sience, Academia Sinica
Spoken Language ProcessingNatural Language ProcessingMultimedia Information RetrievalMachine Learning