🤖 AI Summary
This work systematically evaluates the creative capabilities of large language models (LLMs) in generating five-sentence imaginative short stories, focusing on novelty, surprise, diversity, and linguistic complexity—contrasted against human writers via double-blind evaluation. Methodologically, it introduces the first large-scale benchmark across 60 mainstream LLMs and 60 human authors, deploying a multi-perspective assessment framework integrating automated metrics (semantic divergence, perplexity, n-gram diversity), expert and non-expert human judgments, Turing-test-based classification, and LLM-based meta-evaluation. Results reveal that while LLMs significantly surpass humans in linguistic complexity, they consistently underperform in novelty, surprise, and diversity. Expert ratings strongly correlate with automated metrics, whereas both LLMs and non-experts exhibit systematic overestimation biases. This study establishes a reproducible, multidimensional, cross-subject methodology for rigorously evaluating LLM creativity.
📝 Abstract
Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.