MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
Existing video generation benchmarks are confined to isolated subtasks and fail to evaluate the complex reasoning required to transform disorganized multimodal assets into coherent, executable scripts. This work introduces a novel task—Multimodal Context-to-Script Generation (MCSC)—and presents MCSC-Bench, a dataset comprising over 11K annotated videos, which for the first time enables end-to-end evaluation of asset selection, narrative planning, and conditional script generation. Built upon this benchmark, we develop an 8B-parameter multimodal large language model that integrates structure-aware reasoning, long-context processing, shot planning, and speech alignment. Our model substantially outperforms Gemini-2.5-Pro, achieving significant advances in both the quality of structured script generation and the practical utility of downstream video synthesis.

Technology Category

Application Category

📝 Abstract
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets are available at: https://github.com/huanran-hu/MCSC.
Problem

Research questions and friction points this paper is trying to address.

video production
multimodal reasoning
script generation
narrative planning
benchmark evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Reasoning
Video Script Generation
Narrative Planning
Long-context Understanding
Structured Video Production