🤖 AI Summary
The rapid advancement of AI-generated video quality poses a severe threat to visual content authenticity; however, the scarcity of high-fidelity, real-world forgery video datasets hinders the development of robust detection methods. Method: We introduce GenWorld—the first benchmark dataset tailored to realistic simulation scenarios—comprising videos synthesized by multiple state-of-the-art generative models (including world models such as Cosmos) and cross-modal prompt-driven, high-fidelity forgeries. We further propose SpannDetector, a physically interpretable multi-view consistency detector that pioneers a detection paradigm grounded in spatiotemporal physical plausibility, integrating multi-view feature modeling with a lightweight aggregation network. Contribution/Results: Extensive experiments demonstrate that SpannDetector significantly outperforms existing methods on GenWorld, particularly achieving substantial gains in detecting world-model-generated videos. These results validate the effectiveness and generalizability of physics-guided detection.
📝 Abstract
The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld