Captain Cinema: Towards Short Movie Generation

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of generating high-quality, visually coherent, and narratively consistent short films from fine-grained movie plot text. We propose a keyframe-guided two-stage generation framework: first, top-down keyframe planning models the global narrative structure; second, a bottom-up multimodal diffusion transformer (MM-DiT) synthesizes spatiotemporally continuous video. To enhance temporal stability and cross-shot consistency in multi-scene long videos, we introduce a novel long-context interleaved training strategy. The model is trained on a curated cinematic dataset, balancing generation fidelity and inference efficiency. Experimental results demonstrate state-of-the-art performance in automated film production tasks—achieving superior narrative fidelity and visual coherence compared to prior methods.

Technology Category

Application Category

📝 Abstract

We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai

Problem

Research questions and friction points this paper is trying to address.

Generating coherent short movies from text descriptions

Ensuring long-range narrative and visual consistency

Efficient multi-scene cinematic synthesis with diffusion transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-down keyframe planning for narrative coherence

Bottom-up video synthesis with long context

Interleaved MM-DiT training for multi-scene stability

🔎 Similar Papers

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

2024-03-03arXiv.orgCitations: 21

TikTok

San Jose, California

Machine Learning Engineer Graduate (TikTok Short Video Content Understanding/Multimodal Recommendation) - 2026 Start (BS/MS)

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence