LLM Benchmark Datasets Should Be Contamination-Resistant

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
Current large language model benchmark datasets are often contaminated by inclusion in pretraining corpora, compromising their ability to faithfully assess model generalization. This work systematically defines, for the first time, the core properties of “contamination-resistant” benchmarks and leverages the asymmetry between training and inference in Transformer architectures to propose a cross-model-compatible mathematical formalism that renders data unlearnable during training yet effectively usable during inference. Through contamination detection and theoretical analysis, the study reveals the widespread prevalence of benchmark contamination and establishes a principled framework for designing contamination-resistant benchmarks. The proposed paradigm offers a more reliable foundation for model evaluation and calls upon the research community to adopt such benchmarks to enhance the credibility of performance assessments.
📝 Abstract
Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $\textit{contaminated}$, which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be $\textit{contamination-resistant}$, i.e., $\textit{unlearnable}$, but support $\textit{inference}$. To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) developing supporting methods and platforms, and (iii) adopting contamination-resistant benchmarks into existing evaluation pipelines.
Problem

Research questions and friction points this paper is trying to address.

benchmark contamination
large language models
generalization evaluation
contamination-resistant datasets
reproducible evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

contamination-resistant
benchmark datasets
unlearnable
inference asymmetry
LLM evaluation