A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video forgery detection methods that rely on costly frame-level annotations and thus lack scalability to large-scale datasets. We propose a temporal forgery localization framework requiring only video-level weak supervision. Our method introduces a multimodal discrepancy-aware architecture: (1) modeling audio-visual inconsistency in a probabilistic embedding space via cross-modal attention; (2) enforcing explicit discrepancy amplification within forged segments through a scalable discrepancy-aware loss; and (3) integrating audio and visual features with temporal coherence preservation. Evaluated on multiple benchmarks, our approach achieves localization accuracy—measured by AUC and temporal IoU—comparable to fully supervised counterparts, while drastically reducing annotation overhead. The framework demonstrates strong scalability and offers a practical, weakly supervised solution for large-scale video forgery detection.

Technology Category

Application Category

📝 Abstract
Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Detect forged video segments using video-level annotations
Localize start and end timestamps of forged segments
Measure visual-audio deviation for weakly-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal interaction mechanism for cross-modal attention
Extensible deviation perceiving loss for weak supervision
Temporal property preserving in probabilistic embedding space
🔎 Similar Papers
No similar papers found.
Wenbo Xu
Wenbo Xu
Sun Yat-sen University
MultimodalMultimedia
Junyan Wu
Junyan Wu
Ph.D. student from School of Computer Science and Engineering, Sun Yat-sen University
multimedia forensics and security
W
Wei Lu
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
X
Xiangyang Luo
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
Q
Qian Wang
School of Cyber Science and Engineering, Wuhan University, Wuhan, China