Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

πŸ“… 2025-01-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Pretraining datasets for large language models (LLMs) contain latent security vulnerabilities and license compliance risks due to outdated or uncurated source code. Method: This paper proposes a source-code self-purification framework leveraging the complete version histories of open-source projects. It innovatively exploits evolutionary trajectories via code-block provenance matching, CVE association, license compliance verification, and automated deduplication labeling to precisely identify training samples containing patched vulnerabilities or non-compliant licenses. Contribution/Results: Evaluated on Stack v2, the method reveals that 17% of code files have newer versions, 2.36% of which correspond to CVE fixes; 6,947 residual CVEs are detected, with 58% residing in long-unmodified β€œzombie code”; widespread license mislabeling exposes severe compliance hazards. This work is the first to systematically uncover version-lag-induced security and legal risks in LLM pretraining data and delivers a scalable, automated purification framework.

Technology Category

Application Category

πŸ“ Abstract
A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes in at least one OSS project, and pinpoint a subset of samples that include fixes for bugs or vulnerabilities. We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions, with 17% of those representing bug fixes, including 2.36% addressing known CVEs. The deduplicated version of Stack v2 still includes blobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the dataset were never modified after creation, suggesting they likely represent software with minimal or no use. Misidentified blob origins present an additional challenge, as they lead to the inclusion of non-permissively licensed code, raising serious compliance concerns. By addressing these issues, the training of new models can avoid perpetuating buggy code patterns or license violations. We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.
Problem

Research questions and friction points this paper is trying to address.

Code Quality
Legal Issues
Safety Risks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Model Training
Code Quality and Security Enhancement
Open Source Version History Utilization
πŸ”Ž Similar Papers
No similar papers found.