ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for detecting AI-generated videos struggle to model the generative logic underlying their global temporal evolution, limiting detection performance. This work reveals, for the first time, that AI-generated videos exhibit anchor-driven unnatural repetitive correlations, manifesting as anomalous temporal self-similarity (ATSS). Building on this insight, we propose a multimodal detection framework that constructs visual, textual, and cross-modal similarity matrices from frame-level descriptions. A dedicated Transformer encoder, coupled with a bidirectional cross-attention fusion mechanism, jointly models intra-modal dynamics and inter-modal temporal interactions. Evaluated on four benchmarks—GenVideo, EvalCrafter, VideoPhy, and VidProM—the proposed method significantly outperforms existing approaches, achieving state-of-the-art results in AP, AUC, and ACC metrics and demonstrating strong generalization across diverse generative models.
📝 Abstract
AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.
Problem

Research questions and friction points this paper is trying to address.

AI-generated videos
temporal self-similarity
video forensics
temporal anomalies
deepfake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

anomalous temporal self-similarity
AI-generated video detection
multimodal fusion
cross-attentive mechanism
temporal inconsistency
🔎 Similar Papers