Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization

๐Ÿ“… 2026-01-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of insufficient effective fusion mechanisms in multimodal audio-visual deepfake detection and localization by proposing a two-stage divide-and-conquer framework. In the first stage, forgery detection and fine-grained tampering localization are performed independently within each modalityโ€”audio and visual. The second stage integrates these modality-specific outputs through a data-driven cross-modal score fusion strategy. By synergistically combining intra-modal precise localization with inter-modal discriminative cues, the proposed approach significantly enhances system robustness and generalization capability. Evaluated on the DDL Challenge Track 2 test set, the method achieves an AUC of 0.87, an average precision (AP) of 0.55, an average recall (AR) of 0.23, and a composite score of 0.5528.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score fusion. Specifically, it incorporates an audio deepfake detection module and an audio localization module to analyze and pinpoint manipulated segments in the audio stream. In parallel, an image-based deepfake detection and localization module is employed to process the visual modality. To effectively leverage complementary information across different modalities, we further propose a multimodal score fusion strategy that integrates the outputs from both audio and visual modules. Guided by a detailed analysis of the training and evaluation dataset, we explore and evaluate several score calculation and fusion strategies to improve system robustness. Overall, the final fusion-based system achieves an AUC of 0.87, an AP of 0.55, and an AR of 0.23 on the challenge test set, resulting in a final score of 0.5528.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
multimodal video
audio-visual forgery
manipulation localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion
deepfake detection
cross-modal localization
audio-visual analysis
score fusion
๐Ÿ”Ž Similar Papers
No similar papers found.
Q
Qingcao Li
The State Key Laboratory of Blockchain and Data Security, Zhejiang University; School of Cyber Science and Engineering, Nanjing University of Science and Technology
M
Miao He
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
Liang Yi
Liang Yi
Tongji University
Marine GeologyGeochronologyClimatic changesAsian monsoon
Q
Qing Wen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
Y
Yitao Zhang
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
H
Hongshuo Jin
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
Peng Cheng
Peng Cheng
Zhejiang University
IoTAcoustic Security and PrivacyDigital Signal Processing
Zhongjie Ba
Zhongjie Ba
Zhejiang University
IoT security
Li Lu
Li Lu
Research Professor (Tenure-track), College of Computer Science and Technology, Zhejiang University
Intelligent System SecurityIoT SecurityUbiquitous ComputingMobile Sensing
Kui Ren
Kui Ren
Professor and Dean of Computer Science, Zhejiang University, ACM/IEEE Fellow
Data Security & PrivacyAI SecurityIoT & Vehicular Security