The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing video generation models struggle to produce long-duration, narratively coherent cinematic content from high-level dialogue due to a semantic gap between creative ideation and visual execution. This work proposes the first script-based dual-agent framework: ScripterAgent transforms raw dialogue into structured screenplays, while DirectorAgent orchestrates state-of-the-art video models to enable cross-scene sequential generation, ensuring narrative consistency. The contributions include ScriptBench—a multimodal screenplay benchmark—along with a Visual-Script Alignment metric and a CriticAgent-based automated evaluation system. Experiments demonstrate that the proposed framework significantly improves script fidelity and temporal coherence of generated videos, with consistent gains validated across multiple mainstream models, thereby establishing a new paradigm for automated film production.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap''between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

Problem

Research questions and friction points this paper is trying to address.

long-horizon video generation

dialogue-to-video

semantic gap

cinematic coherence

narrative video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic framework

dialogue-to-video generation

cinematic script generation