Referee: Reference-aware Audiovisual Deepfake Detection

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual deepfake detection methods suffer from poor generalization, especially against unseen generative models. To address this, we propose a reference-aware cross-modal identity verification framework that requires only a single reference sample to extract speaker-specific cues. Our approach jointly models audio-visual synchrony and identity consistency, moving beyond conventional reliance on spatiotemporal artifacts. We introduce two key innovations: (i) a reference-aware mechanism that grounds verification in speaker identity, and (ii) cross-modal feature alignment that integrates identity-relevant query matching with synchrony reasoning for fine-grained identity consistency analysis. Extensive experiments demonstrate state-of-the-art cross-dataset and cross-lingual performance on FakeAVCeleb, FaceForensics++, and KoDF. Our method significantly improves generalization to unknown forgery types and robustness under distribution shifts, establishing a new paradigm for identity-centric deepfake detection.

Technology Category

Application Category

📝 Abstract
Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
Problem

Research questions and friction points this paper is trying to address.

Detecting unseen audiovisual deepfakes using one-shot reference examples
Matching identity queries across reference and target cross-modal features
Jointly reasoning audiovisual synchrony and identity consistency for detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses one-shot reference examples for speaker-specific cues
Matches identity queries across reference and target content
Jointly reasons audiovisual synchrony and identity consistency
🔎 Similar Papers
No similar papers found.
H
Hyemin Boo
Ewha Womans University, Republic of Korea
E
Eunsang Lee
Ewha Womans University, Republic of Korea
Jiyoung Lee
Jiyoung Lee
Assistant Professor, Ewha Womans University
Multimodal LearningComputer VisionMachine Learning