Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited zero-shot generalization capability of deepfake video detection, this paper proposes FauxNet—the first framework to leverage pre-trained visual speech recognition (VSR) features for deepfake detection. Methodologically, it extracts lip-motion–speech temporal consistency features using a VSR model and feeds them into a lightweight temporal classifier, enabling simultaneous forgery classification and generative technique attribution. Its key innovation lies in exploiting the inherent sensitivity of VSR features to synthetic artifacts, thereby endowing the model with zero-shot robustness across diverse deepfake generation algorithms. Extensive experiments demonstrate that FauxNet significantly outperforms state-of-the-art methods on Authentica-Vox, Authentica-HDTF, and FaceForensics++. Furthermore, the authors introduce and publicly release Authentica-DF, a large-scale benchmark dataset comprising 38,000 videos specifically designed for zero-shot deepfake detection evaluation.

Technology Category

Application Category

📝 Abstract
Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Detects deepfake videos using visual speech recognition features
Achieves zero-shot generalizable detection across unseen manipulation techniques
Distinguishes between different deepfake generation methods for attribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Visual Speech Recognition features for detection
Achieves zero-shot generalizable deepfake detection
Introduces new datasets with multiple generation techniques
🔎 Similar Papers
No similar papers found.
M
Maheswar Bora
Machine Intelligence Group, Department of CS&IS, Birla Institute of Technology and Sciences, Pilani, India
T
Tashvik Dhamija
STARS team Inria Center at Université Côte d’Azur in Sophia Antipolis, France
S
Shukesh Reddy
Machine Intelligence Group, Department of CS&IS, Birla Institute of Technology and Sciences, Pilani, India
B
Baptiste Chopin
STARS team Inria Center at Université Côte d’Azur in Sophia Antipolis, France
P
Pranav Balaji
STARS team Inria Center at Université Côte d’Azur in Sophia Antipolis, France
A
Abhijit Das
Machine Intelligence Group, Department of CS&IS, Birla Institute of Technology and Sciences, Pilani, India
Antitza Dantcheva
Antitza Dantcheva
Directrice de Recherche, Inria, France
Video generationDeepfake generation and detectionFace analysis for health monitoring and