Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the limited generalization of existing video deepfake detection methods, which stems from insufficient diversity in training data. To overcome this challenge, the authors propose a novel training paradigm that does not require real forged samples. Instead, it leverages only authentic videos to automatically generate pseudo-forged videos—termed Audio-Visual Pseudo-Fakes (AVPF)—exhibiting diverse audio-visual correspondence patterns. By modeling cross-modal consistency, the approach enhances the model’s ability to recognize unseen forgery types. Evaluated on multiple standard benchmarks, the method achieves an average performance gain of 7.4% and demonstrates significantly improved cross-dataset generalization capability.

Technology Category

Application Category

📝 Abstract

Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.

Problem

Research questions and friction points this paper is trying to address.

Video DeepFake Detection

Generalizability

Audio-Visual Correspondence

Training Data Diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Pseudo-Fakes

DeepFake Detection

Self-generated Training Data

Cross-modal Correspondence