Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit insufficient plot diversity in story generation, with GPT-4 and LLaMA-3 repeatedly producing semantically similar plot elements, indicating limited creative divergence. Method: We propose the *Sui Generis* score—the first metric to formalize plot uniqueness as statistical anomaly within cross-sample semantic distributions—computed via story-segment embeddings, semantic clustering, and probabilistic resampling. Contribution/Results: Evaluated on 100 short stories, our automated framework reveals that LLM outputs are significantly enriched in high-frequency, echoic plot combinations, whereas human-authored plots are rarely replicated. The *Sui Generis* score demonstrates moderate correlation with human “surprise” ratings (r = 0.52, p < 0.01), validating its perceptual relevance. This work establishes an interpretable, scalable, and quantifiable paradigm for assessing LLM creativity, grounded in distributional semantics and statistical novelty detection.

Technology Category

Application Category

📝 Abstract
With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster the collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines generated by the same LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations, while the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of Sui Generis scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Narrative Diversity
Creativity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sui Generis Scoring System
Creativity Diversity Assessment
AI Storytelling Evaluation
🔎 Similar Papers
No similar papers found.