Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI-generated videos (e.g., Sora) exhibit high visual fidelity, necessitating reliable detection methods; however, existing approaches struggle to capture subtle spatiotemporal anomalies that violate physical laws. Method: We propose a physics-guided detection paradigm: for the first time, we introduce the principle of probability current conservation into video forensics, defining a Normalized Spatiotemporal Gradient (NSG) statistic that quantifies the ratio between spatial probability gradient magnitude and temporal density variation. We theoretically derive an upper bound on NSG feature discrepancies, providing physically interpretable detection criteria. Leveraging a pre-trained diffusion model, our NSG estimator jointly approximates spatial gradients and models motion-aware temporal dynamics, with discrimination performed via Maximum Mean Discrepancy (MMD). Results: Experiments demonstrate state-of-the-art performance: our method surpasses prior art by 16.00% in Recall and 10.75% in F1-Score, significantly enhancing both detection accuracy and physical interpretability.

Technology Category

Application Category

📝 Abstract
AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated videos with near-perfect realism
Modeling high-dimensional spatiotemporal dynamics in videos
Identifying subtle anomalies violating physical laws
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-driven modeling for video detection
Normalized Spatiotemporal Gradient quantifies video anomalies
Maximum Mean Discrepancy measures distributional shifts
🔎 Similar Papers
Shuhai Zhang
Shuhai Zhang
华南理工大学
Computer VisionMachine Learning
Z
ZiHao Lian
South China University of Technology
J
Jiahao Yang
South China University of Technology
D
Daiyuan Li
South China University of Technology
G
Guoxuan Pang
University of Science and Technology of China
F
Feng Liu
University of Melbourne
B
Bo Han
Hong Kong Baptist University
S
Shutao Li
Hunan University
Mingkui Tan
Mingkui Tan
South China University of Technology
Machine LearningLarge-scale Optimization