The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to produce long-duration, narratively coherent cinematic content from high-level dialogue due to a semantic gap between creative ideation and visual execution. This work proposes the first script-based dual-agent framework: ScripterAgent transforms raw dialogue into structured screenplays, while DirectorAgent orchestrates state-of-the-art video models to enable cross-scene sequential generation, ensuring narrative consistency. The contributions include ScriptBench—a multimodal screenplay benchmark—along with a Visual-Script Alignment metric and a CriticAgent-based automated evaluation system. Experiments demonstrate that the proposed framework significantly improves script fidelity and temporal coherence of generated videos, with consistent gains validated across multiple mainstream models, thereby establishing a new paradigm for automated film production.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap''between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.
Problem

Research questions and friction points this paper is trying to address.

long-horizon video generation
dialogue-to-video
semantic gap
cinematic coherence
narrative video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic framework
dialogue-to-video generation
cinematic script generation
long-horizon coherence
Visual-Script Alignment
C
Chenyu Mu
Tencent Hunyuan Multimodal Department
X
Xin He
Tencent Hunyuan Multimodal Department
Qu Yang
Qu Yang
National University of Singapore
Deep LearningSpiking Neural NetworkNeuromprphic Computing
W
Wanshun Chen
Tencent Hunyuan Multimodal Department
J
Jiadi Yao
Tencent Hunyuan Multimodal Department
H
Huang Liu
Tencent Hunyuan Multimodal Department
Z
Zihao Yi
Tencent Hunyuan Multimodal Department
B
Bo Zhao
Tencent Hunyuan Multimodal Department
X
Xingyu Chen
Tencent Hunyuan Multimodal Department
R
Ruotian Ma
Tencent Hunyuan Multimodal Department
F
F. Ye
Tencent Hunyuan Multimodal Department
E
Erkun Yang
Xidian University
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation
Xiaolong Li
Xiaolong Li
Tencent Group, Alibaba Group, Ant Group, Microsoft
LLMDigital HumanNLP/ChatbotsDeep LearningSpeech&Audio Processing
L
Linus
Tencent Hunyuan Multimodal Department