GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation of text-to-video generation lacks systematic benchmarks targeting spatiotemporal artifacts—such as physical implausibility and temporal inconsistency. To address this gap, we introduce GeneVA, the first large-scale, human-annotated benchmark for evaluating video generation artifacts, specifically focusing on spatial and temporal inconsistencies and physical reasoning errors induced by model stochasticity. Videos are generated from natural language prompts, and expert annotators label four canonical artifact categories: motion anomalies, geometric distortions, physical violations, and temporal discontinuities—establishing a fine-grained evaluation framework. GeneVA fills a critical data void in quantitative video generation quality assessment, enabling cross-model benchmarking and diagnostic analysis of generative mechanisms. By providing standardized, reproducible evaluation infrastructure, it advances research toward physically plausible and temporally coherent video synthesis.

Technology Category

Application Category

📝 Abstract
Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.
Problem

Research questions and friction points this paper is trying to address.

Identifying unpredictable artifacts in text-driven video generation
Addressing spatio-temporal complexities in generative video artifacts
Lacking systematic benchmarks for video generation model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale annotated dataset for video artifacts
Focuses on spatio-temporal inconsistencies in generated videos
Enables benchmarking and improvement of generative video models
🔎 Similar Papers
No similar papers found.
J
Jenna Kang
New York University
M
Maria Silva
New York University
Patsorn Sangkloy
Patsorn Sangkloy
New York University
Computer VisionDeep Learning
Kenneth Chen
Kenneth Chen
New York University
Computer GraphicsVision ScienceVirtual RealityComputational DisplaysApplied Perception
N
Niall Williams
New York University
Q
Qi Sun
New York University