🤖 AI Summary
This paper identifies a fundamental flaw in n-gram novelty as an evaluation metric for textual creativity: it captures originality alone while neglecting appropriateness (i.e., coherence and utility), thus failing to reflect creativity’s dual nature. Method: Leveraging human annotations from 26 domain experts on 7,542 AI-generated texts, the study integrates close textual analysis with classification experiments using zero-shot, few-shot, and fine-tuned language models, and systematically evaluates LLM-as-a-Judge for creativity scoring. Contribution/Results: Key findings include: (1) 91% of highly novel n-gram expressions were not judged creative; (2) open-weight LMs generate high-novelty but low-readability text; (3) state-of-the-art closed-weight LMs still significantly underperform humans; and (4) the best automated scorers predict human preferences yet fail markedly on identifying non-functional expressions. This work provides the first empirical demonstration of n-gram novelty’s misleadingness and establishes the “novelty–appropriateness” two-dimensional evaluation paradigm.
📝 Abstract
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.