Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the relationship between model scale and short-story generation quality, challenging the assumption that larger language models inherently produce superior creative writing. Method: We comparatively evaluate fine-tuned BART-large, GPT-3.5, GPT-4o, and human-authored stories across grammatical correctness, narrative quality, creativity, and engagement. Assessment employs supervised fine-tuning, human crowdsourced evaluation (N=68), and qualitative linguistic analysis—including coherence, cliché frequency, and unexpectedness of semantic associations. Contribution/Results: Our empirical findings demonstrate that carefully fine-tuned BART-large achieves a composite score of 2.11—significantly exceeding the human baseline (1.85; +14%) and outperforming GPT-4o in surprise expression (15% vs. 3%). This reveals that compact models, when optimally trained, can exhibit stronger non-canonical associative capabilities than state-of-the-art large models. The results refute the “bigger is better” paradigm and establish a fundamental trade-off among model scale, predictability, and creative originality.

Technology Category

Application Category

📝 Abstract
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Storytelling Quality
Human Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

BART-large
storytelling quality
unexpected plot twists
🔎 Similar Papers
No similar papers found.