🤖 AI Summary
To address benchmark contamination in large language model (LLM) evaluation—which distorts assessment validity—this paper proposes the first verifiable watermarking mechanism for LLM evaluation benchmarks. Prior to benchmark release, a dedicated watermarking LLM semantically rewrites questions while embedding statistically detectable, imperceptible textual watermarks that preserve original meaning. During evaluation, watermark presence in model outputs is detected via likelihood ratio testing. The method provides three theoretical guarantees: (i) contamination detectability, (ii) zero utility loss, and (iii) formal provability. Empirical results show no performance degradation on benchmarks such as ARC-Easy; moreover, it reliably detects anomalous performance gains exceeding +5% at a significance level of *p* = 10⁻³, demonstrating high sensitivity and robustness against evasion.
📝 Abstract
Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, e.g. $p$-val $=10^{-3}$ for +5$%$ on ARC-Easy.