🤖 AI Summary
Existing text-to-image storyboard generation methods struggle to maintain visual coherence, character consistency, and narrative fluency across multiple shots. This work proposes the first controllable multi-shot storyboard generation framework leveraging video diffusion priors, supporting both text- and reference-image-driven shot generation and story continuation. By introducing a multi-reference character conditioning module and a character attention consistency loss, the method effectively exploits the spatiotemporal priors inherent in video diffusion models, significantly enhancing character identity alignment and scene coherence. Experimental results demonstrate that the proposed approach outperforms state-of-the-art storyboard generation models in terms of narrative fidelity, character consistency, and generation efficiency.
📝 Abstract
Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.