Towards multi-modal forgery representation learning for AI-generated video detection and localization

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

223K/year
πŸ€– AI Summary
This work addresses the limitations of existing AI-generated video detection methods, which predominantly rely on single-modality modeling and struggle to achieve fine-grained temporal localization of localized manipulations. To overcome this, we propose the first unified multimodal joint architecture that end-to-end integrates, across multiple scales, a large language model–driven semantic branch, a spatiotemporal visual branch, and a partially forged audio branch. This framework enables simultaneous detection and high-precision temporal localization of locally manipulated regions within AI-generated videos. Extensive evaluations demonstrate that our approach significantly outperforms current state-of-the-art methods across multiple benchmarks, achieving superior performance in both detection accuracy and temporal localization capability.
πŸ“ Abstract
Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

AI-generated video
multi-modal forgery
temporal localization
video detection
partial manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal learning
forgery localization
AI-generated video detection
spatio-temporal modeling
partial-spoof audio
πŸ”Ž Similar Papers
No similar papers found.