🤖 AI Summary
Current video generation models struggle with generating long-duration, multi-scene, and narratively coherent videos, while high-quality benchmark datasets supporting cross-shot character ID consistency and precise audio-visual-text alignment remain scarce. To address this, we introduce the first film-grade hierarchical dataset specifically designed for long-video generation, featuring a three-tier annotation schema—“script-scene-frame”—that enables multi-scene narrative modeling, cross-shot character ID tracking, and fine-grained multimodal (visual/auditory/textual) alignment. This dataset fills a critical gap in evaluation and training benchmarks for long-video generation, systematically uncovering novel challenges such as character ID drift for the first time. It also provides a reproducible, quantitative evaluation protocol grounded in standardized metrics. The dataset is publicly released and actively maintained, establishing a foundational resource for developing and rigorously assessing next-generation long-video generation models.
📝 Abstract
Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: https://weijiawu.github.io/MovieBench/.