🤖 AI Summary
Current evaluation of text-to-video generation lacks systematic benchmarks targeting spatiotemporal artifacts—such as physical implausibility and temporal inconsistency. To address this gap, we introduce GeneVA, the first large-scale, human-annotated benchmark for evaluating video generation artifacts, specifically focusing on spatial and temporal inconsistencies and physical reasoning errors induced by model stochasticity. Videos are generated from natural language prompts, and expert annotators label four canonical artifact categories: motion anomalies, geometric distortions, physical violations, and temporal discontinuities—establishing a fine-grained evaluation framework. GeneVA fills a critical data void in quantitative video generation quality assessment, enabling cross-model benchmarking and diagnostic analysis of generative mechanisms. By providing standardized, reproducible evaluation infrastructure, it advances research toward physically plausible and temporally coherent video synthesis.
📝 Abstract
Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.