Detecting Benchmark Contamination Through Watermarking

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

To address benchmark contamination in large language model (LLM) evaluation—which distorts assessment validity—this paper proposes the first verifiable watermarking mechanism for LLM evaluation benchmarks. Prior to benchmark release, a dedicated watermarking LLM semantically rewrites questions while embedding statistically detectable, imperceptible textual watermarks that preserve original meaning. During evaluation, watermark presence in model outputs is detected via likelihood ratio testing. The method provides three theoretical guarantees: (i) contamination detectability, (ii) zero utility loss, and (iii) formal provability. Empirical results show no performance degradation on benchmarks such as ARC-Easy; moreover, it reliably detects anomalous performance gains exceeding +5% at a significance level of *p* = 10⁻³, demonstrating high sensitivity and robustness against evasion.

Technology Category

Application Category

📝 Abstract

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, e.g. $p$-val $=10^{-3}$ for +5$%$ on ARC-Easy.

Problem

Research questions and friction points this paper is trying to address.

Detecting benchmark contamination in LLMs

Watermarking benchmarks to ensure reliability

Validating contamination detection through statistical tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Watermarking benchmarks for contamination detection

Reformulating questions with watermarked LLM

Statistical test to detect training contamination

🔎 Similar Papers

Discovering Spoofing Attempts on Language Model Watermarks

2024-10-03Citations: 2