Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM novelty evaluation suffers from imbalanced originality–quality trade-offs and unreliable human preferences. Method: We propose a joint originality–quality evaluation paradigm: originality is quantified via the proportion of unseen n-grams, quality via task-specific scores, and novelty unified via harmonic mean; we further introduce and model the originality–quality Pareto frontier. Contribution/Results: Evaluated systematically on OLMo and Pythia across story, poetry, and creative tool generation, we find current open-weight LLMs exhibit significantly lower novelty than humans. Scaling and post-training shift the Pareto frontier rightward, whereas inference perturbations (e.g., temperature tuning) only enable trade-offs along the frontier. This work refutes the misconception that high originality necessarily implies low quality, establishing a reproducible, decomposable benchmark framework for assessing LLM creativity.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of grams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM novelty beyond memorization vs. quality

Proposing a metric balancing originality and task-specific quality

Exploring methods to enhance LLM novelty without sacrificing quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novelty metric balances originality and quality

Inference-time methods trade originality for quality

Stronger base models improve novelty effectively

🔎 Similar Papers

No similar papers found.

Authors to Follow