LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical issue of evaluation distortion in software engineering (SE) tasks due to pretraining data leakage in large language models (LLMs). We conduct the first large-scale quantitative analysis of data leakage across 83 SE benchmarks. Our method integrates text similarity metrics (BLEU, CodeBERTScore) with subset detection, augmented by pretraining data provenance tracing and platform-level source attribution. Results reveal substantial inter-benchmark variation in leakage rates—ranging from 100% on QuixBugs to just 0.7% on C/C++ benchmarks—and demonstrate that leakage artificially inflates model performance. Based on these findings, we introduce LessLeak-Bench: the first de-leaked, cross-lingual SE evaluation benchmark. This work fills a fundamental gap in systematic leakage assessment for LLMs in SE, providing both a rigorous methodology and open-source infrastructure to enable trustworthy LLM evaluation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen'' by LLMs during the model's construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8%, 2.8%, and 0.7% for Python, Java, and C/C++ benchmarks, respectively. However, some benchmarks exhibit relatively higher leakage ratios, which raises concerns about their bias in evaluation. For instance, QuixBugs and BigCloneBench have leakage ratios of 100.0% and 55.7%, respectively. Furthermore, we observe that data leakage has a substantial impact on LLM evaluation. We also identify key causes of high data leakage, such as the direct inclusion of benchmark data in pre-training datasets and the use of coding platforms like LeetCode for benchmark construction. To address the data leakage, we introduce extbf{LessLeak-Bench}, a new benchmark that removes leaked samples from the 83 SE benchmarks, enabling more reliable LLM evaluations in future research. Our study enhances the understanding of data leakage in SE benchmarks and provides valuable insights for future research involving LLMs in SE.
Problem

Research questions and friction points this paper is trying to address.

Assesses data leakage in SE benchmarks
Identifies causes of high data leakage
Introduces LessLeak-Bench for reliable evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LessLeak-Bench benchmark
Analyzes data leakage in 83 SE benchmarks
Identifies causes of high data leakage
🔎 Similar Papers
No similar papers found.
X
Xin Zhou
Singapore Management University, Singapore
Martin Weyssow
Martin Weyssow
Research Scientist, Singapore Management University
Deep Learning for CodeLarge Language ModelsAI4SE
Ratnadira Widyasari
Ratnadira Widyasari
Singapore Management University
Computer science
T
Ting Zhang
Singapore Management University, Singapore
Junda He
Junda He
Singapore Management University
software engineering
Yunbo Lyu
Yunbo Lyu
PhD Candidate, Singapore Management University
Software Engineering
J
Jianming Chang
Southeast University, China
Beiqi Zhang
Beiqi Zhang
Wuhan University
Software EngineeringSE4AIAI4SE
D
Dan Huang
Singapore Management University, Singapore
D
David Lo
Singapore Management University, Singapore