VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

📅 2024-12-03

📈 Citations: 5

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing video generation models struggle to produce coherent, multi-shot narrative videos, suffering from fragmented storylines, cross-shot visual inconsistency, and abrupt transitions. To address these limitations, we propose the first cinematic narrative-oriented, stepwise “chain-of-thought” video generation paradigm. Our method systematically tackles three core challenges—narrative fragmentation, identity drift, and style mismatch—via dynamic plot modeling, identity-aware inter-shot token propagation (IPP), and boundary-aware neighboring latent-space transition. We further introduce a novel five-dimensional cinematic normalization description framework and a latent-space reset strategy. Experiments demonstrate over 100% improvement in cross-shot consistency, along with 20.4% and 17.4% gains in intra-frame facial and stylistic consistency, respectively; manual editing effort is reduced by 90%. The approach significantly enhances long-range narrative controllability and cinematic expressiveness.

Technology Category

Application Category

📝 Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.

Problem

Research questions and friction points this paper is trying to address.

Automates multi-shot video synthesis from single sentence input.

Addresses narrative fragmentation, visual inconsistency, and transition artifacts.

Improves cross-shot consistency and reduces manual adjustments significantly.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic storyline modeling for narrative progression

Identity-aware cross-shot propagation for visual consistency

Adjacent latent transition for seamless visual flow

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence