🤖 AI Summary
Traditional deepfake video detection methods rely on spatial-frequency analysis and struggle to model fine-grained inter-frame pixel-level temporal dynamic inconsistencies. To address this, we propose a pixel-level temporal-frequency detection framework: for each pixel, we apply 1D Fourier transform to its temporal intensity sequence to explicitly capture subtle temporal-frequency anomalies. We further design an attention-based proposal module and a joint spatiotemporal Transformer to fuse temporal-frequency features with local and global spatiotemporal context, enabling end-to-end forgery localization. Our method significantly improves sensitivity to covert temporal artifacts in naturally moving regions—such as lip-motion or blink desynchronization—while outperforming state-of-the-art spatial-frequency approaches on benchmarks including FaceForensics++ and DFDC. It demonstrates strong generalization across diverse generation methods and robustness to compression and quality degradation.
📝 Abstract
We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.