🤖 AI Summary
Existing evaluations of large language model (LLM) creativity largely neglect non-English literary traditions, particularly regarding culturally grounded rhetorical competence. Method: We construct a user-generated Persian literary dataset spanning 20 thematic domains and propose the first systematic, culture-adapted creativity assessment framework for non-English contexts—quantifying originality, fluency, flexibility, and elaboration. Drawing on the Torrance Tests of Creative Thinking, we adapt and validate an automated LLM-based scoring mechanism achieving high inter-rater reliability with human annotators (ICC > 0.85), substantially reducing evaluation cost. Results: Empirical analysis reveals LLMs’ strengths in deploying core rhetorical devices (e.g., simile, metaphor, hyperbole, antithesis), yet exposes persistent cultural expression bottlenecks. Our framework provides both empirical evidence and methodological infrastructure to guide cross-lingual literary generation model development and refinement.
📝 Abstract
Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models' ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.