Training on the Benchmark Is Not All You Need

📅 2024-09-03
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the critical issue of evaluation distortion caused by potential overlap between large language model (LLM) pretraining data and benchmark test sets, this paper proposes a gray-box data leakage detection method. It requires no access to model weights or training data; instead, it exploits semantic invariance in multiple-choice options by randomly permuting answer choices to generate perturbed samples, then detects anomalous peaks in the model’s output log-probability distributions. The method establishes a lightweight, general, and gray-box-compatible leakage detection paradigm, supporting both active and passive shuffling scenarios. Systematic evaluation across 2 closed-source and 35 open-source LLMs on 4 major benchmarks demonstrates its effectiveness in quantifying leakage severity and ranking models—revealing that the Qwen series exhibits the most pronounced leakage.

Technology Category

Application Category

📝 Abstract
The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under gray-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
Problem

Research questions and friction points this paper is trying to address.

Detects data leakage in LLM pre-training data.
Uses shuffled multiple-choice options for leakage detection.
Evaluates leakage in 35 LLMs across four benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shuffling options detects data leakage effectively.
Log probability distribution identifies leaked data.
Gray-box method works without training data access.
🔎 Similar Papers
No similar papers found.