Evaluating Creative Short Story Generation in Humans and Large Language Models

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work systematically evaluates the creative capabilities of large language models (LLMs) in generating five-sentence imaginative short stories, focusing on novelty, surprise, diversity, and linguistic complexity—contrasted against human writers via double-blind evaluation. Methodologically, it introduces the first large-scale benchmark across 60 mainstream LLMs and 60 human authors, deploying a multi-perspective assessment framework integrating automated metrics (semantic divergence, perplexity, n-gram diversity), expert and non-expert human judgments, Turing-test-based classification, and LLM-based meta-evaluation. Results reveal that while LLMs significantly surpass humans in linguistic complexity, they consistently underperform in novelty, surprise, and diversity. Expert ratings strongly correlate with automated metrics, whereas both LLMs and non-experts exhibit systematic overestimation biases. This study establishes a reproducible, multidimensional, cross-subject methodology for rigorously evaluating LLM creativity.

Technology Category

Application Category

📝 Abstract

Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

Problem

Research questions and friction points this paper is trying to address.

Evaluates creativity in short story generation by humans and LLMs.

Compares novelty, surprise, diversity, and linguistic complexity in stories.

Analyzes discrepancies in creativity ratings between humans and LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic analysis of 60 LLMs and humans

Automated evaluation of creativity dimensions

Comparison of expert and non-expert ratings

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models