🤖 AI Summary
Existing deepfake detection methods suffer from poor generalization, reliance on pretraining with authentic samples, and overemphasis on audio-visual inconsistency—leading to frequent failure in detecting intra-modal local artifacts. To address these limitations, we propose a pretraining-free, single-stage end-to-end framework that innovatively introduces future-frame prediction mechanisms—both unimodal and cross-modal—and a window-level attention module to jointly model inter-frame temporal anomalies and cross-modal interaction deviations. Our approach simultaneously performs whole-video forgery classification and precise temporal localization of forged segments. Evaluated across multiple cross-domain benchmark datasets, the method achieves significant improvements over current state-of-the-art approaches, demonstrating breakthrough performance in both generalization capability and temporal localization accuracy.
📝 Abstract
Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.