🤖 AI Summary
Existing LLM novelty evaluation suffers from imbalanced originality–quality trade-offs and unreliable human preferences. Method: We propose a joint originality–quality evaluation paradigm: originality is quantified via the proportion of unseen n-grams, quality via task-specific scores, and novelty unified via harmonic mean; we further introduce and model the originality–quality Pareto frontier. Contribution/Results: Evaluated systematically on OLMo and Pythia across story, poetry, and creative tool generation, we find current open-weight LLMs exhibit significantly lower novelty than humans. Scaling and post-training shift the Pareto frontier rightward, whereas inference perturbations (e.g., temperature tuning) only enable trade-offs along the frontier. This work refutes the misconception that high originality necessarily implies low quality, establishing a reproducible, decomposable benchmark framework for assessing LLM creativity.
📝 Abstract
As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of
grams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.