Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study evaluates the performance of large language models (LLMs) on the culturally specific narrative task of continuing Chinese film scripts. To this end, we construct the first reproducible benchmark dataset for Chinese creative writing, comprising 303 valid samples derived from the “first-half-to-second-half” continuation paradigm across 53 classic films. We propose a multidimensional evaluation framework that integrates ROUGE-L, structural similarity metrics, and an LLM-as-Judge mechanism powered by DeepSeek-Reasoner, supplemented with paired statistical analysis. Experimental results demonstrate that GPT-5.2 significantly outperforms Qwen-Max in structural coherence, character consistency, stylistic alignment, formatting adherence, and overall quality, while Qwen-Max exhibits weaker generation stability. This work establishes a new benchmark and evaluation paradigm for narrative generation in Chinese.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a"first half to second half"continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.

Problem

Research questions and friction points this paper is trying to address.

large language models

Chinese film script continuation

creative writing

culturally specific narrative

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese film script continuation

large language models

multi-dimensional evaluation