Towards Contamination Resistant Benchmarks

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language model (LLM) evaluation suffers from training data contamination, leading to inflated performance estimates and unreliable assessments of true reasoning capability. Method: We propose the first rigorously defined, contamination-resistant evaluation paradigm, introducing a cryptographic benchmark task based on the Caesar cipher—designed to be transparent, intractable via shortcut learning, and resistant to implicit memorization. Our methodology integrates controlled contamination experiments, cross-model zero- and few-shot evaluation protocols, and contamination溯源 analysis to strictly isolate and quantify training-data leakage effects. Contribution/Results: Under controlled contamination conditions, mainstream LLMs—including GPT-4—exhibit accuracy drops below 30%, exposing substantial overestimation of their genuine reasoning abilities. This work overcomes the fundamental limitation of conventional benchmarks—susceptibility to contamination—and establishes a new, trustworthy standard and methodological framework for robust, contamination-aware LLM evaluation.

Technology Category

Application Category

📝 Abstract

The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g.,"ab"to"bc"when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing contamination in LLM evaluation reliability

Proposing Caesar cipher benchmark for contamination resistance

Revealing LLM limitations under controlled contamination settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces contamination resistance concept for LLM evaluation

Proposes Caesar cipher-based benchmark for contamination control

Tests LLMs under controlled contamination settings

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models