STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the challenges of weak inter-shot consistency and absence of cinematic language in multi-shot, film-grade narrative video generation. Methodologically: (1) structured storyboards—represented as frame-indexed shot boundaries—serve as spatiotemporal anchors to enforce strong narrative control; (2) a multi-shot memory bank is introduced to model long-range entity consistency across shots; (3) a global–local dual-encoder architecture with a two-stage training strategy is designed to ensure intra-shot coherence and cinematic inter-shot transitions. Evaluated on ConStoryBoard, a large-scale, manually annotated storyboard-video dataset curated for this task, our framework achieves significant improvements over state-of-the-art methods in both structured narrative controllability and inter-shot consistency. To our knowledge, it is the first approach to enable high-fidelity, highly controllable multi-shot cinematic narrative video generation.

Technology Category

Application Category

📝 Abstract
While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.
Problem

Research questions and friction points this paper is trying to address.

Creating coherent multi-shot narratives with visual fidelity
Maintaining cross-shot consistency in keyframe-based video generation
Capturing cinematic language and inter-shot transitions effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts structural storyboard with start-end frame pairs
Ensures consistency using multi-shot memory pack
Learns cinematic transitions via two-stage training scheme