CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) struggle to simultaneously achieve narrative depth and structural fidelity in screenplay generation, particularly exhibiting deficiencies in dialogue coherence (DC), character consistency (CC), and plot reasonableness (PR). To address this, we propose CML-Bench—the first multidimensional evaluation framework specifically designed for screenplay generation—and introduce the high-quality benchmark dataset CML-Dataset. We innovatively formalize screenplays using Cinematic Markup Language (CML), enabling structured representation; integrate multi-example continuity analysis with rigorous human validation; and design the CML-Instruction prompting strategy to enhance cinematic quality (e.g., visual expressiveness, pacing, and dramatic tension). Experimental results demonstrate that CML-Bench effectively discriminates between human-written and LLM-generated screenplays, and when combined with CML-Instruction, LLM outputs achieve a 32.7% improvement in human preference scores across cinematic quality dimensions.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the 'soul' of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where 'content' consists of segments from esteemed, high-quality movie scripts and 'summary' is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated movie scripts' storytelling and emotional depth
Assessing dialogue coherence, character consistency, and plot reasonableness
Developing benchmarks and instructions to improve cinematic script quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed CML-Bench framework with quantitative evaluation metrics
Introduced CML-Instruction prompting strategy for structured scripts
Guided LLMs to generate higher-quality cinematically sound screenplays