We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

📅 2024-06-12
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
This work identifies and systematically investigates “package hallucination”—a novel software supply chain security threat wherein code large language models (Code LLMs) generate non-existent or incompatible package names. Through automated evaluation of 576,000 Python and JavaScript samples across 16 state-of-the-art models, we empirically reveal alarmingly high hallucination rates: 5.2% for commercial models and 21.7% for open-source models, yielding over 205,000 unique hallucinated package names. We propose a root-cause analysis framework identifying three primary drivers: training data bias, contextual confusion, and tokenization artifacts. To mitigate the issue, we design a multi-strategy intervention—integrating data curation, context-aware prompting, and post-generation validation—that significantly reduces hallucination while preserving functional correctness and executability of generated code. Our study provides both theoretical insights and practical guidelines for building trustworthy, production-ready code generation systems.

Technology Category

Application Category

📝 Abstract
The reliance of popular programming languages such as Python and JavaScript on centralized package repositories and open-source software, combined with the emergence of code-generating Large Language Models (LLMs), has created a new type of threat to the software supply chain: package hallucinations. These hallucinations, which arise from fact-conflicting errors when generating code using LLMs, represent a novel form of package confusion attack that poses a critical threat to the integrity of the software supply chain. This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages, settings, and parameters, exploring how a diverse set of models and configurations affect the likelihood of generating erroneous package recommendations and identifying the root causes of this phenomenon. Using 16 popular LLMs for code generation and two unique prompt datasets, we generate 576,000 code samples in two programming languages that we analyze for package hallucinations. Our findings reveal that that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat. To overcome this problem, we implement several hallucination mitigation strategies and show that they are able to significantly reduce the number of package hallucinations while maintaining code quality. Our experiments and findings highlight package hallucinations as a persistent and systemic phenomenon while using state-of-the-art LLMs for code generation, and a significant challenge which deserves the research community's urgent attention.
Problem

Research questions and friction points this paper is trying to address.

Analyzes package hallucinations in code-generating LLMs
Identifies root causes of erroneous package recommendations
Proposes strategies to mitigate package hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes package hallucinations in code-generating LLMs
Evaluates 16 LLMs across diverse programming languages
Implements strategies to mitigate package hallucinations
🔎 Similar Papers
No similar papers found.