🤖 AI Summary
Existing multimodal deepfake detection methods struggle to simultaneously capture modality-specific artifacts (e.g., facial swapping traces, spectral distortions) and cross-modal semantic misalignment (e.g., audio-visual desynchronization).
Method: We propose an Alignment–Distillation Collaborative Modeling framework: (i) a dual-stream encoder extracts audio and visual features; (ii) a contrastive learning–driven cross-modal alignment module models high-level semantic synchrony; and (iii) a knowledge distillation mechanism decouples and preserves fine-grained modality-specific cues during feature fusion.
Contribution/Results: This is the first work to jointly leverage cross-modal alignment and knowledge distillation for deepfake detection, unifying low-level artifact modeling and high-level semantic inconsistency detection. Our method achieves state-of-the-art performance on both multimodal (e.g., FakeAVCeleb, DF-TIMIT) and unimodal (e.g., FaceForensics++, Celeb-DF) benchmarks, with significant gains in accuracy (+2.1–4.7%) and cross-domain generalization. Results validate the efficacy of synergistically modeling complementary multimodal cues.
📝 Abstract
The rapid emergence of multimodal deepfakes (visual and auditory content are manipulated in concert) undermines the reliability of existing detectors that rely solely on modality-specific artifacts or cross-modal inconsistencies. In this work, we first demonstrate that modality-specific forensic traces (e.g., face-swap artifacts or spectral distortions) and modality-shared semantic misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and that neglecting either aspect limits detection performance. Existing approaches either naively fuse modality-specific features without reconciling their conflicting characteristics or focus predominantly on semantic misalignment at the expense of modality-specific fine-grained artifact cues. To address these shortcomings, we propose a general multimodal framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates feature conflicts during fusion while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio). Extensive experiments on both multimodal and unimodal (e.g., image-only/video-only)deepfake benchmarks demonstrate that CAD significantly outperforms previous methods, validating the necessity of harmonious integration of multimodal complementary information.