Deepfake Detection with Spatio-Temporal Consistency and Attention

📅 2022-11-30
🏛️ International Conference on Digital Image Computing: Techniques and Applications
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing Deepfake detection methods often overlook local spatiotemporal inconsistencies and subtle forgery patterns. To address this, we propose an end-to-end neural network that jointly models frame-level spatial attention and sequence-level distance attention for fine-grained forgery localization and classification. Our approach innovatively fuses texture-enhanced shallow and deep features and introduces a distance attention mechanism to explicitly capture cross-frame temporal dependencies, thereby modeling spatiotemporal-coupled forgery signatures. Built upon a ResNet backbone, the architecture integrates spatial attention, distance attention, multi-level feature fusion, and texture enhancement modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance on FaceForensics++ and Celeb-DF benchmarks, outperforming prior approaches in detection accuracy while maintaining superior efficiency in memory footprint and computational cost.

Technology Category

Application Category

📝 Abstract
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.
Problem

Research questions and friction points this paper is trying to address.

Detects Deepfake videos using spatio-temporal inconsistencies
Enhances detection with spatial and temporal attention mechanisms
Improves performance and efficiency over existing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial attention mechanism
Temporal attention mechanism
ResNet backbone
🔎 Similar Papers
No similar papers found.
Y
Yunzhuo Chen
The University of Western Australia, Perth, Australia
N
Naveed Akhtar
The University of Western Australia, Perth, Australia
Nur Al Hasan Haldar
Nur Al Hasan Haldar
Curtin University
Data ScienceCyber SecurityGraph AnalyticsSocial Network AnalysisDatabase
A
Ajmal Mian
The University of Western Australia, Perth, Australia