🤖 AI Summary
Current video generation models struggle to produce coherent multi-shot narrative videos, facing three key bottlenecks: narrative fragmentation, visual inconsistency, and jarring shot transitions. To address this, we propose the first end-to-end, single-sentence-prompt-driven framework for multi-shot video generation—requiring no manual scripting or post-editing. Our method introduces three core innovations: (1) dynamic storyline modeling, (2) identity-aware cross-shot propagation, and (3)邻近潜空间过渡 (proximal latent-space transition), integrated with five-dimensional cinematic normalization (character motion, background continuity, relational evolution, camera movement, and high-dynamic-range lighting), IPP-based identity token generation, and boundary-aware latent-space reset. Experiments demonstrate 20.4% and 17.4% improvements in intra-shot face and style consistency, respectively; a 100% gain in cross-shot consistency over SOTA; and a 90% reduction in manual adjustments. To our knowledge, this is the first work to jointly ensure character consistency, narrative coherence, and visual fluency in multi-shot video generation.
📝 Abstract
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.