🤖 AI Summary
Existing Deepfake detection methods often overlook local spatiotemporal inconsistencies and subtle forgery patterns. To address this, we propose an end-to-end neural network that jointly models frame-level spatial attention and sequence-level distance attention for fine-grained forgery localization and classification. Our approach innovatively fuses texture-enhanced shallow and deep features and introduces a distance attention mechanism to explicitly capture cross-frame temporal dependencies, thereby modeling spatiotemporal-coupled forgery signatures. Built upon a ResNet backbone, the architecture integrates spatial attention, distance attention, multi-level feature fusion, and texture enhancement modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance on FaceForensics++ and Celeb-DF benchmarks, outperforming prior approaches in detection accuracy while maintaining superior efficiency in memory footprint and computational cost.
📝 Abstract
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.