🤖 AI Summary
This work addresses the limited generalization of existing deepfake detection methods against unseen generation techniques, such as Sora2. The authors propose a fully self-supervised, audio-driven personalized facial expression diffusion model that requires no real or fake video training data. By reconstructing facial expression sequences and computing an identity distance derived from diffusion reconstruction errors, the method enables zero-shot deepfake detection for specific individuals. This approach pioneers the integration of personalized audio-expression diffusion modeling into forgery detection, demonstrating strong generalization and robustness. Evaluated on four benchmarks—DF-TIMIT, DFDCP, KoDF, and IDForge—it achieves an average AUC improvement of 4.22 percentage points over state-of-the-art methods, effectively detects Sora2-generated videos, and exhibits high resilience to common perturbations such as compression and blurring.
📝 Abstract
Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.