π€ AI Summary
This work investigates the *n*-gram novelty of language model (LM) generationsβi.e., the proportion of *n*-grams in generated text absent from the training corpus. To quantify this, we introduce *n*-novelty, a formal metric, and develop Rusty-DAWG: the first Rust implementation of a Directed Acyclic Word Graph (DAWG) enabling O(1) average-case *n*-gram lookup for arbitrary *n*. Using Pythia models and systematic *n*-gram frequency analysis, we empirically demonstrate two key findings: (1) for *n* > 4, LM outputs exhibit *lower* *n*-gram novelty than human-written text, contradicting intuition; and (2) training *n*-gram frequency correlates strongly and negatively with model completion loss. These results challenge assumptions about LM creativity and memorization. We publicly release Rusty-DAWG to support reproducible, scalable analysis of pretraining data provenance and LM memory behavior, providing both a novel methodological tool and empirical grounding for future research on LM generalization and memorization.
π Abstract
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.