🤖 AI Summary
This work addresses the limitation of existing video generation evaluation methods, which provide only coarse-grained quality scores and fail to localize or categorize specific artifacts. To overcome this, the authors propose an artifact-aware, fine-grained evaluation protocol that systematically defines ten common types of generation artifacts across three dimensions: appearance, motion, and camera dynamics. They introduce GenVID, a large-scale annotated dataset comprising 80,000 generated videos, and develop DVAR, a deep learning framework capable of dense artifact identification and classification. This study establishes the first structured taxonomy for video generation artifacts, substantially improving detection accuracy across artifact categories and offering a reliable tool for evaluating and debugging generative models as well as filtering low-quality synthetic content.
📝 Abstract
With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.