Provable Joint Decontamination for Benchmarking Multiple Large Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the pervasive issue of training data contamination in large language model evaluation, which often leads to inflated performance estimates and distorted comparisons. The authors formalize the problem of decontaminating multi-model benchmarks as a joint selection task and propose JECS, a novel method that computes p-values for each model via conformal inference, aggregates them by taking the maximum over samples, and—by reconstructing a conservative envelope of the right-tail null distribution combined with an adaptive Benjamini–Hochberg procedure—achieves, for the first time, theoretical control over the global contamination rate. Experimental results demonstrate that JECS substantially outperforms the max-p baseline in detection power while rigorously maintaining the target contamination rate.

📝 Abstract

Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.

Problem

Research questions and friction points this paper is trying to address.

benchmark contamination

large language models

fair model comparison

evaluation reliability

training data leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

conformal selection

benchmark decontamination

global contamination rate

large language models

joint model evaluation

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

2024-03-31Citations: 6

PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

2024-06-26Conference on Empirical Methods in Natural Language ProcessingCitations: 0