Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

📅 2024-07-20

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 1

career value

174K/year

🤖 AI Summary

This study investigates whether large language models’ (LLMs) capabilities stem primarily from generalization or memorization of pretraining data. Method: We propose “distributed memory” — a novel metric quantifying the statistical correlation between model output token probabilities and token frequencies in the pretraining corpus — and introduce the Task-Token Language Model (Task-Token LM) to estimate task-specific pretraining token frequencies. Leveraging the Pythia model family and The Pile dataset, we integrate n-gram semantic co-occurrence analysis to assess memory–generalization trade-offs across tasks. Results: Fact-based question answering relies heavily on distributed memory, whereas machine translation and mathematical reasoning depend predominantly on generalization. Scaling model size significantly amplifies memory effects only for factual tasks, confirming that task complexity governs the dominant capability mechanism. This work provides the first systematic disentanglement of memorization and generalization in LLMs, yielding a quantifiable, task-adaptive analytical framework for capability attribution.

Technology Category

Application Category

📝 Abstract

The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram language model, which is built by counting the co-occurrence of semantically related $n$-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.

Problem

Research questions and friction points this paper is trying to address.

Explores whether LLMs generalize or memorize pretraining data.

Introduces distributional memorization to measure data correlation.

Analyzes memorization vs. generalization across diverse tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced distributional memorization concept

Developed task-gram language model

Analyzed memorization vs. generalization in LLMs

🔎 Similar Papers

No similar papers found.