🤖 AI Summary
This study investigates higher-order limitations of large language models (LLMs) in generating high-quality narrative fiction, specifically focusing on causal coherence, character intentionality, and dramatic conflict. Method: It introduces narrative planning theory from computational narratology into LLM evaluation for the first time, constructing the first multidimensional benchmark covering these dimensions. The benchmark employs test cases derived from canonical literary works and evaluates zero-shot and few-shot generations from models including GPT-4, validated via human annotation and automated metrics (e.g., causal chain completeness, intention consistency). Contribution/Results: Results reveal that state-of-the-art LLMs reliably produce only short narratives with causal coherence; modeling character intentions and dynamically constructing dramatic conflict remain significant bottlenecks. Moreover, narrative quality exhibits a critical degradation with increasing length. This work establishes a novel, scalable paradigm and benchmark for evaluating narrative AI capabilities.
📝 Abstract
Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.