Are Large Language Models Memorizing Bug Benchmarks?

📅 2024-11-20

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work systematically assesses training data leakage risks of large language models (LLMs) on mainstream bug benchmarks such as Defects4J, revealing a critical threat to the validity of code-defect evaluation. We propose the first multi-dimensional quantitative framework integrating training-set membership inference, negative log-likelihood analysis, n-gram accuracy detection, and cross-benchmark comparative experiments—empirically characterizing, for the first time, the extent to which LLMs memorize buggy examples. Our findings reveal significant generational disparities: CodeGen-Multi exhibits strong memorization, whereas LLaMA-3.1 shows negligible leakage, empirically validating that scaling training data volume effectively mitigates memorization risk. Based on these insights, we establish principled guidelines for benchmark selection and evaluation reliability, providing both methodological foundations and practical standards for assessing LLMs’ code-related capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM susceptibility to data leakage from bug benchmarks

Quantifying memorization impact on model performance evaluation

Identifying benchmark contamination in popular LLM training datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluate LLMs for data leakage

Use multiple metrics to identify memorization

Compare memorization across different model sizes

🔎 Similar Papers

No similar papers found.