How much can we forget about Data Contamination?

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates benchmark data contamination as a confounding factor in large language model (LLM) evaluation. To address the critical question of *whether and how models forget early-exposed contaminated samples*, the study systematically controls contamination intensity and training scale along three axes—model parameters, training steps, and total tokens—grounded in the Chinchilla scaling law. Empirical validation is conducted via continued pretraining of OLMo-7B and retroactive analysis of Llama 3 405B. The study provides the first quantitative characterization of sample “forgettability” under multidimensional scaling. Key findings include: (i) contamination effects decay significantly beyond the Chinchilla-optimal training budget; (ii) 144 contaminations are effectively forgotten under five times the Chinchilla-optimal token count; and (iii) weight decay schedules do not consistently align with empirical forgetting rates. These results motivate a new contamination-resilient evaluation paradigm for LLMs.

Technology Category

Application Category

📝 Abstract
The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Lllama 3 405B, have forgotten the data seen at the beginning of training.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Data Contamination
Memory Retention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Data Forgetting
Weight Decay Impact