Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual deepfake detection methods commonly adopt dual-branch architectures, leading to modality fragmentation, insufficient cross-modal feature fusion, and model redundancy. To address these limitations, this work proposes a single-stream multimodal learning framework that eliminates separate audio and visual subnetworks. Instead, it introduces a collaborative audio-visual learning module enabling continuous inter-layer cross-modal feature fusion, thereby enhancing content dependency modeling and robustness against modality mismatch. A lightweight multimodal classification module is further integrated to support end-to-end joint training. The resulting model contains only 0.48M parameters yet achieves state-of-the-art performance across DF-TIMIT, FakeAVCeleb, and DFDC benchmarks—outperforming prior methods on both unimodal and multimodal forgeries, as well as unseen attack types. This demonstrates superior efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract
Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub-models can result in redundant neural layers, making the overall model inefficient and impractical for resource-constrained environments. In this work, we design a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework. Specifically, we introduce a collaborative audio-visual learning block to efficiently integrate multi-modal information while learning the visual and audio features. By iteratively employing this block, our single-stream network achieves a continuous fusion of multi-modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi-modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF-TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state-of-the-art audio-visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes, as well as in unseen types of deepfakes.
Problem

Research questions and friction points this paper is trying to address.

Detects audio-visual deepfakes via single-stream multi-modal learning
Improves efficiency by reducing redundant neural layers
Enhances resistance to audio-visual mismatches in deepfakes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stream multi-modal learning framework
Collaborative audio-visual learning block
Lightweight network with 0.48M parameters
🔎 Similar Papers
No similar papers found.
K
Kuiyuan Zhang
Computer Science and Technology, Harbin Institute of Technology Shenzhen, Shenzhen, China
W
Wenjie Pei
Computer Science and Technology, Harbin Institute of Technology Shenzhen, Shenzhen, China
Rushi Lan
Rushi Lan
Guilin University of Electronic Technology
image processingpattern classification
Y
Yifang Guo
Alibaba Group, Hangzhou, China
Zhongyun Hua
Zhongyun Hua
Professor, Harbin Institute of Technology, Shenzhen
Applied CryptographyTrustworthy AIMultimedia SecurityNonlinear Systems and Applications