π€ AI Summary
Existing video generation methods excel at single-shot synthesis but struggle to produce multi-shot videos with narrative coherence and flexible shot composition. This paper introduces the first controllable generation framework for narrative multi-shot video synthesis. Our approach extends pretrained single-shot diffusion models by incorporating a narrative-aware multi-shot RoPE and a spatiotemporal position-aware RoPE, enabling explicit phase shifts at shot transitions and cross-shot reference information injection. We further design reference token injection, cross-shot localization signal fusion, and spatiotemporal grounding mechanisms. To address data scarcity, we construct an automated multi-shot annotation pipeline. Experiments demonstrate significant improvements over prior work in shot coherence, textβvideo alignment, and controllability over shot count and duration. The framework enables high-fidelity, editable multi-shot narrative video generation.
π Abstract
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.