Provable Joint Decontamination for Benchmarking Multiple Large Language Models

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This study addresses the pervasive issue of training data contamination in large language model evaluation, which often leads to inflated performance estimates and distorted comparisons. The authors formalize the problem of decontaminating multi-model benchmarks as a joint selection task and propose JECS, a novel method that computes p-values for each model via conformal inference, aggregates them by taking the maximum over samples, and—by reconstructing a conservative envelope of the right-tail null distribution combined with an adaptive Benjamini–Hochberg procedure—achieves, for the first time, theoretical control over the global contamination rate. Experimental results demonstrate that JECS substantially outperforms the max-p baseline in detection power while rigorously maintaining the target contamination rate.
📝 Abstract
Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.
Problem

Research questions and friction points this paper is trying to address.

benchmark contamination
large language models
fair model comparison
evaluation reliability
training data leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

conformal selection
benchmark decontamination
global contamination rate
large language models
joint model evaluation