🤖 AI Summary
This work addresses model provenance for face-swapping deepfake videos—specifically, fine-grained attribution to the generative model used, rather than binary fake detection. We propose a lightweight spatiotemporal modeling framework featuring a novel CNN backbone that jointly integrates spatial-temporal dual attention and multi-scale feature embedding. Trained end-to-end, it efficiently captures model-specific artifacts with high fidelity. Our method achieves an optimal trade-off between accuracy and computational efficiency: on DFDM, FaceForensics++, and FakeAVCeleb benchmarks, it outperforms state-of-the-art approaches by 3.2–5.7% in average classification accuracy and achieves 2.1× faster inference, enabling real-time deployment for digital forensic applications.
📝 Abstract
The widespread emergence of face-swap Deepfake videos poses growing risks to digital security, privacy, and media integrity, necessitating effective forensic tools for identifying the source of such manipulations. Although most prior research has focused primarily on binary Deepfake detection, the task of model attribution -- determining which generative model produced a given Deepfake -- remains underexplored. In this paper, we introduce FAME (Fake Attribution via Multilevel Embeddings), a lightweight and efficient spatio-temporal framework designed to capture subtle generative artifacts specific to different face-swap models. FAME integrates spatial and temporal attention mechanisms to improve attribution accuracy while remaining computationally efficient. We evaluate our model on three challenging and diverse datasets: Deepfake Detection and Manipulation (DFDM), FaceForensics++, and FakeAVCeleb. Results show that FAME consistently outperforms existing methods in both accuracy and runtime, highlighting its potential for deployment in real-world forensic and information security applications.