🤖 AI Summary
This work addresses the limited generalizability of existing audio-visual deepfake detection methods, which often rely on dataset-specific artifacts. To overcome this, the authors propose a multimodal framework that jointly performs deepfake detection and generator attribution by incorporating generator provenance as a structured regularizer. A novel cross-modal forensic fingerprint consistency (CMFFC) loss is introduced to guide the model toward learning discriminative features tied to the underlying generation mechanisms. The video branch employs a ResNet50 backbone enhanced with temporal attention, while the audio branch utilizes a pretrained ResNet18 to process Mel-spectrograms. Evaluated on FakeAVCeleb, the method achieves 99.7% balanced accuracy and 99.8% AUC for detection, along with 95.9% generator attribution accuracy, and demonstrates strong cross-dataset generalization performance.
📝 Abstract
Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts.
We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt.
Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.