Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max

πŸ“… 2026-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study evaluates the performance of large language models (LLMs) on the culturally specific narrative task of continuing Chinese film scripts. To this end, we construct the first reproducible benchmark dataset for Chinese creative writing, comprising 303 valid samples derived from the β€œfirst-half-to-second-half” continuation paradigm across 53 classic films. We propose a multidimensional evaluation framework that integrates ROUGE-L, structural similarity metrics, and an LLM-as-Judge mechanism powered by DeepSeek-Reasoner, supplemented with paired statistical analysis. Experimental results demonstrate that GPT-5.2 significantly outperforms Qwen-Max in structural coherence, character consistency, stylistic alignment, formatting adherence, and overall quality, while Qwen-Max exhibits weaker generation stability. This work establishes a new benchmark and evaluation paradigm for narrative generation in Chinese.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a"first half to second half"continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.
Problem

Research questions and friction points this paper is trying to address.

large language models
Chinese film script continuation
creative writing
culturally specific narrative
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese film script continuation
large language models
multi-dimensional evaluation
LLM-as-Judge
benchmark dataset
πŸ”Ž Similar Papers
No similar papers found.
Yuxuan Cao
Yuxuan Cao
Hong Kong University of Science and Technology
data miningllmllm reasoning
Z
Zida Yang
Guanghua School of Management, Peking University
Y
Ye Wang
School of Journalism and Communication, Wuhan University