Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deepfake detection methods suffer from poor generalization, reliance on pretraining with authentic samples, and overemphasis on audio-visual inconsistency—leading to frequent failure in detecting intra-modal local artifacts. To address these limitations, we propose a pretraining-free, single-stage end-to-end framework that innovatively introduces future-frame prediction mechanisms—both unimodal and cross-modal—and a window-level attention module to jointly model inter-frame temporal anomalies and cross-modal interaction deviations. Our approach simultaneously performs whole-video forgery classification and precise temporal localization of forged segments. Evaluated across multiple cross-domain benchmark datasets, the method achieves significant improvements over current state-of-the-art approaches, demonstrating breakthrough performance in both generalization capability and temporal localization accuracy.

Technology Category

Application Category

📝 Abstract
Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.
Problem

Research questions and friction points this paper is trying to address.

Detecting multimodal deepfakes with improved generalization across unseen manipulations
Addressing limitations of methods that overlook intra-modal artifacts in videos
Enabling precise temporal localization of deepfake segments in partially spoofed content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Next-frame prediction for uni-modal and cross-modal features
Window-level attention to capture frame discrepancies
Single-stage training for enhanced generalization capability
🔎 Similar Papers
No similar papers found.
A
Ashutosh Anshul
College of Computing and Data Science, Nanyang Technological University, Singapore
S
Shreyas Gopal
College of Computing and Data Science, Nanyang Technological University, Singapore
Deepu Rajan
Deepu Rajan
Nanyang Technological University
Image ProcessingComputer Vision
E
E. Chng
College of Computing and Data Science, Nanyang Technological University, Singapore