🤖 AI Summary
This work addresses the limitations of conventional multimodal deepfake detection methods, particularly insufficient feature extraction and misalignment between modalities, by proposing a multiscale cross-modal Transformer-based detection framework. The approach integrates a multiscale self-attention mechanism to capture both local and global forgery cues and introduces a differential cross-modal attention module to effectively align and fuse audio-visual features, thereby enhancing the discriminability of forgery traces. Experimental results on the FakeAVCeleb dataset demonstrate that the proposed method significantly improves detection performance, underscoring its effectiveness and state-of-the-art capability in multimodal deepfake identification.
📝 Abstract
Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.