🤖 AI Summary
Existing zero-shot deepfake provenance methods suffer from poor cross-generator generalization and insufficient exploitation of multimodal cues. Method: We propose a dual-modality-guided multi-view representation learning framework that jointly models visual (image/noise/edge), facial parsing, and textual modalities—introducing a facial parsing encoder and a language encoder to collaboratively guide visual forgery representation learning, alongside a novel Deepfake Attribution Contrastive Center (DFACC) loss to enhance inter-class separability and intra-class compactness. Contribution/Results: Extensive experiments under diverse zero-shot evaluation protocols demonstrate significant improvements in provenance accuracy for unseen generators, consistently outperforming state-of-the-art methods. Our approach effectively overcomes the limitations of unimodal modeling, enabling fine-grained and robust cross-generator deepfake attribution.
📝 Abstract
The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen generators in a fine-grained manner. In this paper, we propose a novel bi-modal guided multi-perspective representation learning (BMRL) framework for zero-shot deepfake attribution (ZS-DFA), which facilitates effective traceability to unseen generators. Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views (i.e., image, noise, and edge). We devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via vision-parsing matching. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual forgery representation learning through vision-language alignment. Additionally, we present a novel deepfake attribution contrastive center (DFACC) loss, to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results demonstrate that our method outperforms the state-of-the-art on the ZS-DFA task through various protocols evaluation.