DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
Existing text-to-image storyboard generation methods struggle to maintain visual coherence, character consistency, and narrative fluency across multiple shots. This work proposes the first controllable multi-shot storyboard generation framework leveraging video diffusion priors, supporting both text- and reference-image-driven shot generation and story continuation. By introducing a multi-reference character conditioning module and a character attention consistency loss, the method effectively exploits the spatiotemporal priors inherent in video diffusion models, significantly enhancing character identity alignment and scene coherence. Experimental results demonstrate that the proposed approach outperforms state-of-the-art storyboard generation models in terms of narrative fidelity, character consistency, and generation efficiency.

Technology Category

Application Category

📝 Abstract
Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
Problem

Research questions and friction points this paper is trying to address.

storyboard synthesis
temporal coherence
character consistency
narrative flow
visual storytelling
Innovation

Methods, ideas, or system contributions that make the work stand out.

video diffusion prior
storyboard synthesis
temporal coherence
role consistency
multi-reference conditioning
🔎 Similar Papers
No similar papers found.
J
Junjia Huang
1Sun Yat-sen University,2Peng Cheng Laboratory
B
Binbin Yang
3ByteDance Intelligent Creation
P
Pengxiang Yan
3ByteDance Intelligent Creation
J
Jiyang Liu
3ByteDance Intelligent Creation
B
Bin Xia
3ByteDance Intelligent Creation
Z
Zhao Wang
3ByteDance Intelligent Creation
Yitong Wang
Yitong Wang
ByteDance Inc.
computer vision
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
G
Guanbin Li
1Sun Yat-sen University,2Peng Cheng Laboratory,4Guangdong Key Laboratory of Big Data Analysis and Processing