OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

πŸ“… 2026-04-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

212K/year
πŸ€– AI Summary
This work addresses the challenging task of automatically converting long-form movie videos into structured screenplays, which remains largely unexplored. We introduce the Video-to-Screenplay (V2S) task, establish the first human-annotated benchmark for long videos with hierarchical screenplay annotations, and propose OmniScriptβ€”a parameter-efficient multimodal language model that integrates 8B-parameter audiovisual representations. Our approach combines chain-of-thought supervised fine-tuning with a reinforcement learning strategy employing temporally segmented rewards. Through progressive training, OmniScript significantly outperforms existing open-source large models in temporal alignment and cross-domain semantic accuracy, achieving performance on par with state-of-the-art closed-source systems such as Gemini 3-Pro. Additionally, we present the first temporally aware hierarchical evaluation framework tailored for screenplay generation from video.

Technology Category

Application Category

πŸ“ Abstract
Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.
Problem

Research questions and friction points this paper is trying to address.

video-to-script
long-form cinematic video
temporally grounded script
audio-visual understanding
hierarchical script generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-to-script
omni-modal language model
temporal grounding
hierarchical script generation
reinforcement learning with segmented rewards
πŸ”Ž Similar Papers