Quantifying the Effect of Test Set Contamination on Generative Evaluations

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the critical issue of test set contamination in state-of-the-art AI systems pretrained on web-scale data, which severely undermines the reliability of generative evaluations. By systematically injecting varying copies of the MATH benchmark into the pretraining corpus, the authors quantitatively assess the impact of contamination on generative performance and investigate the roles of model scale, retraining strategies, and inference settings. They demonstrate for the first time that even a single copy of a test sample can reduce model loss below the irreducible error floor observed in uncontaminated settings. Unlike discriminative tasks, generative tasks exhibit greater resistance to memorization for longer answers. The experiments reveal that contamination substantially inflates generative performance, that high-temperature sampling mitigates memorization effects, and that the efficacy of continued training is highly dependent on contamination levels—offering crucial insights for trustworthy AI evaluation.

Technology Category

Application Category

📝 Abstract

As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.

Problem

Research questions and friction points this paper is trying to address.

test set contamination

generative evaluations

language models

memorization

AI evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

test set contamination

generative evaluation

scaling laws