🤖 AI Summary
This work addresses the challenges of weak inter-shot consistency and absence of cinematic language in multi-shot, film-grade narrative video generation. Methodologically: (1) structured storyboards—represented as frame-indexed shot boundaries—serve as spatiotemporal anchors to enforce strong narrative control; (2) a multi-shot memory bank is introduced to model long-range entity consistency across shots; (3) a global–local dual-encoder architecture with a two-stage training strategy is designed to ensure intra-shot coherence and cinematic inter-shot transitions. Evaluated on ConStoryBoard, a large-scale, manually annotated storyboard-video dataset curated for this task, our framework achieves significant improvements over state-of-the-art methods in both structured narrative controllability and inter-shot consistency. To our knowledge, it is the first approach to enable high-fidelity, highly controllable multi-shot cinematic narrative video generation.
📝 Abstract
While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.