π€ AI Summary
This study evaluates the performance of large language models (LLMs) on the culturally specific narrative task of continuing Chinese film scripts. To this end, we construct the first reproducible benchmark dataset for Chinese creative writing, comprising 303 valid samples derived from the βfirst-half-to-second-halfβ continuation paradigm across 53 classic films. We propose a multidimensional evaluation framework that integrates ROUGE-L, structural similarity metrics, and an LLM-as-Judge mechanism powered by DeepSeek-Reasoner, supplemented with paired statistical analysis. Experimental results demonstrate that GPT-5.2 significantly outperforms Qwen-Max in structural coherence, character consistency, stylistic alignment, formatting adherence, and overall quality, while Qwen-Max exhibits weaker generation stability. This work establishes a new benchmark and evaluation paradigm for narrative generation in Chinese.
π Abstract
As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a"first half to second half"continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.