CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limited generalizability of existing AI-generated video detection methods, which often overlook the unnaturally stable temporal alignment between visual and textual semantics in synthetic content. To tackle this issue, the authors propose the Cross-Modal Temporal Alignment (CMTA) framework, which for the first time identifies and leverages the atypical temporal stability of vision–language semantic trajectories as a discriminative cue. CMTA integrates BLIP for frame-level captioning and CLIP for cross-modal representation extraction, then employs a GRU to model coarse-grained temporal dynamics and a Transformer to capture fine-grained inter-frame variations. Evaluated across four large-scale datasets encompassing 40 subsets, CMTA substantially outperforms current state-of-the-art methods and demonstrates exceptional generalization across diverse video generation models.

📝 Abstract

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA

Problem

Research questions and friction points this paper is trying to address.

AI-generated video detection

cross-modal alignment

temporal artifacts

video authenticity

semantic stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal temporal artifact

AI-generated video detection

semantic alignment stability