MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing video generation methods excel at single-shot synthesis but struggle to produce multi-shot videos with narrative coherence and flexible shot composition. This paper introduces the first controllable generation framework for narrative multi-shot video synthesis. Our approach extends pretrained single-shot diffusion models by incorporating a narrative-aware multi-shot RoPE and a spatiotemporal position-aware RoPE, enabling explicit phase shifts at shot transitions and cross-shot reference information injection. We further design reference token injection, cross-shot localization signal fusion, and spatiotemporal grounding mechanisms. To address data scarcity, we construct an automated multi-shot annotation pipeline. Experiments demonstrate significant improvements over prior work in shot coherence, text–video alignment, and controllability over shot count and duration. The framework enables high-fidelity, editable multi-shot narrative video generation.

Technology Category

Application Category

📝 Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Problem

Research questions and friction points this paper is trying to address.

Generates narrative multi-shot videos with flexible shot arrangement

Enables spatiotemporal-grounded reference injection for controllability

Overcomes data scarcity via automated multi-shot video annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Shot Narrative RoPE for flexible shot arrangement

Spatiotemporal Position-Aware RoPE for reference injection

Automated data annotation pipeline for multi-shot videos

🔎 Similar Papers

No similar papers found.