🤖 AI Summary
This work addresses the limited generalization and poor robustness of existing multimodal deepfake detection methods, which heavily rely on synthetic training data and struggle with unseen forgery techniques. To overcome these limitations, we propose a self-supervised audio-visual deepfake detection framework trained exclusively on real videos. Our approach generates identity-preserving, region-aware pseudo-deepfake samples online to learn multi-granularity visual artifacts, while simultaneously modeling the temporal alignment between lip movements and speech to capture cross-modal inconsistencies. Notably, the method requires no forged data during training, yet achieves strong in-domain detection performance and remarkable cross-dataset generalization, as demonstrated by state-of-the-art results on both FakeAVCeleb and AV-LipSync-TIMIT benchmarks.
📝 Abstract
Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.