🤖 AI Summary
AI-generated videos frequently exhibit temporal inconsistencies, physically implausible motions, and geometric distortions—yet existing benchmarks lack pixel-level spatial annotations required for fine-grained artifact localization. To address this gap, we introduce BrokenVideos, the first large-scale benchmark explicitly designed for spatially precise localization of artifacts in AI-generated videos. It comprises 3,254 videos accompanied by human-verified, pixel-level artifact masks. BrokenVideos is the first dataset to enable dedicated, high-fidelity spatial localization of generation artifacts, thereby bridging dual deficiencies in annotation granularity and spatial accuracy. Leveraging this benchmark, we train and rigorously evaluate both conventional detection models and multimodal large language models on artifact localization. Experimental results demonstrate significant improvements in localization accuracy—measured via intersection-over-union and precision-recall metrics—establishing a new standard for quantitative assessment of generative video quality.
📝 Abstract
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.