Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG

📅 2024-06-18

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 14

✨ Influential: 1

career value

183K/year

🤖 AI Summary

This work investigates the *n*-gram novelty of language model (LM) generations—i.e., the proportion of *n*-grams in generated text absent from the training corpus. To quantify this, we introduce *n*-novelty, a formal metric, and develop Rusty-DAWG: the first Rust implementation of a Directed Acyclic Word Graph (DAWG) enabling O(1) average-case *n*-gram lookup for arbitrary *n*. Using Pythia models and systematic *n*-gram frequency analysis, we empirically demonstrate two key findings: (1) for *n* > 4, LM outputs exhibit *lower* *n*-gram novelty than human-written text, contradicting intuition; and (2) training *n*-gram frequency correlates strongly and negatively with model completion loss. These results challenge assumptions about LM creativity and memorization. We publicly release Rusty-DAWG to support reproducible, scalable analysis of pretraining data provenance and LM memory behavior, providing both a novel methodological tool and empirical grounding for future research on LM generalization and memorization.

Technology Category

Application Category

📝 Abstract

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating novelty of LM-generated texts versus training data

Developing Rusty-DAWG tool for efficient n-gram search

Analyzing factors affecting text novelty in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rusty-DAWG search tool for n-gram analysis

Constant-time corpus search inspired by genomic indexing

Evaluates n-gram novelty across arbitrary lengths

🔎 Similar Papers

No similar papers found.