GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current evaluation of text-to-video generation lacks systematic benchmarks targeting spatiotemporal artifacts—such as physical implausibility and temporal inconsistency. To address this gap, we introduce GeneVA, the first large-scale, human-annotated benchmark for evaluating video generation artifacts, specifically focusing on spatial and temporal inconsistencies and physical reasoning errors induced by model stochasticity. Videos are generated from natural language prompts, and expert annotators label four canonical artifact categories: motion anomalies, geometric distortions, physical violations, and temporal discontinuities—establishing a fine-grained evaluation framework. GeneVA fills a critical data void in quantitative video generation quality assessment, enabling cross-model benchmarking and diagnostic analysis of generative mechanisms. By providing standardized, reproducible evaluation infrastructure, it advances research toward physically plausible and temporally coherent video synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.

Problem

Research questions and friction points this paper is trying to address.

Identifying unpredictable artifacts in text-driven video generation

Addressing spatio-temporal complexities in generative video artifacts

Lacking systematic benchmarks for video generation model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale annotated dataset for video artifacts

Focuses on spatio-temporal inconsistencies in generated videos

Enables benchmarking and improvement of generative video models

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence