MSCT: Differential Cross-Modal Attention for Deepfake Detection

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional multimodal deepfake detection methods, particularly insufficient feature extraction and misalignment between modalities, by proposing a multiscale cross-modal Transformer-based detection framework. The approach integrates a multiscale self-attention mechanism to capture both local and global forgery cues and introduces a differential cross-modal attention module to effectively align and fuse audio-visual features, thereby enhancing the discriminability of forgery traces. Experimental results on the FakeAVCeleb dataset demonstrate that the proposed method significantly improves detection performance, underscoring its effectiveness and state-of-the-art capability in multimodal deepfake identification.
📝 Abstract
Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
multi-modal
feature extraction
modal alignment
audio-visual inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale self-attention
differential cross-modal attention
cross-modal transformer
deepfake detection
audio-visual alignment
🔎 Similar Papers
No similar papers found.
F
Fangda Wei
Beijing Institute of Technology, China
M
Miao Liu
Beijing Institute of Technology, China
Y
Yingxue Wang
China Academy of Electronics and Information Technology, China
Jing Wang
Jing Wang
Beijing Institute of Technology
speech and audio signal processingmultimedia communicationvirtual realitytensor analysisetc.
S
Shenghui Zhao
Beijing Institute of Technology, China
Nan Li
Nan Li
China Academy of Electronics and Information Technology
Artificial IntelligenceData Science