ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-shot video generation methods focus solely on low-level inter-frame visual consistency, neglecting cinematic transition design and narrative coherence. To address this, we propose a director-level controllable multi-shot video generation framework featuring: (i) a novel parameterized 6-DoF + intrinsic camera control scheme with editing-mode-aware hierarchical prompting; (ii) ShotWeaver40K—the first large-scale cinematic editing prior dataset—and its dedicated evaluation protocol; and (iii) a conditional generation architecture integrating shot-aware masking, hierarchical semantic injection, and editing-rules-driven conditioning. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches in transition plausibility, fidelity to directorial intent, and joint camera-motion–content controllability, achieving consistent improvements across multiple quantitative and qualitative metrics.

Technology Category

Application Category

📝 Abstract
Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
Problem

Research questions and friction points this paper is trying to address.

Generates multi-shot videos with controllable cinematographic transitions
Integrates camera control and editing patterns for narrative coherence
Addresses lack of intentional film-editing patterns in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Camera control module with 6-DoF poses and intrinsics
Hierarchical prompts using shot-aware mask mechanism
Dataset and metrics for film-like editing patterns
🔎 Similar Papers
No similar papers found.
Xiaoxue Wu
Xiaoxue Wu
Fudan University
video generation
X
Xinyuan Chen
Shanghai Artificial Intelligence Laboratory
Yaohui Wang
Yaohui Wang
Research Scientist, Shanghai AI Laboratory | Inria
Machine LearningDeep Generative ModelsVideo Generation
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory