🤖 AI Summary
The increasing photorealism of AI-generated faces undermines human annotation reliability, while existing supervised deepfake detection methods suffer severe performance degradation on unlabeled social media data due to distribution shift.
Method: This paper proposes an unsupervised deepfake detection framework based on a dual-path network that jointly integrates text-guided cross-domain vision–semantics alignment, curriculum-based pseudo-label optimization, and cross-domain knowledge distillation—thereby mitigating both distribution shift and catastrophic forgetting. Crucially, learnable prompts are employed to enable robust multimodal embedding alignment.
Contribution/Results: To our knowledge, this is the first work achieving robust unsupervised detection under highly overlapping real/fake face distributions. Evaluated on 11 mainstream benchmarks, our method achieves an average accuracy gain of +6.3% over state-of-the-art approaches, significantly improving unlabeled-data utilization and generalization across domains.
📝 Abstract
Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even extbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on extbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by extbf{6.3%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.