🤖 AI Summary
This study investigates the relationship between model scale and short-story generation quality, challenging the assumption that larger language models inherently produce superior creative writing. Method: We comparatively evaluate fine-tuned BART-large, GPT-3.5, GPT-4o, and human-authored stories across grammatical correctness, narrative quality, creativity, and engagement. Assessment employs supervised fine-tuning, human crowdsourced evaluation (N=68), and qualitative linguistic analysis—including coherence, cliché frequency, and unexpectedness of semantic associations. Contribution/Results: Our empirical findings demonstrate that carefully fine-tuned BART-large achieves a composite score of 2.11—significantly exceeding the human baseline (1.85; +14%) and outperforming GPT-4o in surprise expression (15% vs. 3%). This reveals that compact models, when optimally trained, can exhibit stronger non-canonical associative capabilities than state-of-the-art large models. The results refute the “bigger is better” paradigm and establish a fundamental trade-off among model scale, predictability, and creative originality.
📝 Abstract
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.