Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

๐Ÿ“… 2025-08-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL)โ€”a novel task requiring frame-level deepfake detection and localization using only video-level labels. We propose WMMT, the first systematic solution for WS-MTFL, which jointly models visual and audio modalities via a unified architecture. WMMT integrates multitask learning, a Mixture-of-Experts (MoE) structure, temporal-preserving attention, and a scalable bias-aware loss, enabling cross-modal adaptive feature selection and enhanced forgery signal discrimination. Evaluated on multiple benchmarks, WMMT achieves localization accuracy competitive with fully supervised methods under weak supervisionโ€”e.g., improving F1@0.5 by 12.3%. This is the first demonstration that multitask collaborative modeling significantly improves both effectiveness and robustness in weakly supervised temporal forgery localization. Our approach establishes a new paradigm for low-cost, label-efficient multimodal trustworthy media analysis.

Technology Category

Application Category

๐Ÿ“ Abstract
The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Detect and localize Deepfake videos using weak supervision
Integrate visual and audio modalities via multitask learning
Enhance temporal feature precision with adaptive mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multitask learning for multimodal forgery localization
Mixture-of-Experts for adaptive feature selection
Temporal property preserving attention mechanism
Wenbo Xu
Wenbo Xu
Sun Yat-sen University
MultimodalMultimedia
W
Wei Lu
School of Computer Science and Engineering, MoE Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510006, China
X
Xiangyang Luo
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China