🤖 AI Summary
Addressing the challenge of jointly ensuring multi-character interaction, shot continuity, and visual effects consistency in cinematic scene synthesis, this paper proposes a two-stage generative framework. In the first stage, a large language model (LLM) performs structured storyboard planning, explicitly modeling character relationships, action timing, and cinematic grammar. In the second stage, a cinematic-semantic-enhanced text-to-image model generates high-fidelity keyframes, augmented by multi-scale spatiotemporal consistency constraints and a dedicated cinematic rendering module. We further introduce CineVerse—the first large-scale dataset tailored for film synthesis—comprising 12K professional storyboard–image pairs. Experiments demonstrate state-of-the-art performance in keyframe coherence, multi-character dynamic interaction modeling, and transition naturalness, significantly improving narrative plausibility and cinematic authenticity of generated scenes.
📝 Abstract
We present CineVerse, a novel framework for the task of cinematic scene composition. Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames. However, our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects. In order to learn to generate such content, we first create the CineVerse dataset. We use this dataset to train our proposed two-stage approach. First, we prompt a large language model (LLM) with task-specific instructions to take in a high-level scene description and generate a detailed plan for the overall setting and characters, as well as the individual shots. Then, we fine-tune a text-to-image generation model to synthesize high-quality visual keyframes. Experimental results demonstrate that CineVerse yields promising improvements in generating visually coherent and contextually rich movie scenes, paving the way for further exploration in cinematic video synthesis.