Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper systematically investigates security risks arising from package hallucination—i.e., the generation of non-existent or malicious software packages—by large language models (LLMs) in code synthesis. To address the multi-language setting, we construct a cross-lingual benchmark and propose a custom hallucination metric. We conduct the first quantitative analysis linking hallucination rates to model scale, programming language, task specificity, and HumanEval performance. Empirical results reveal a sparse Pareto frontier between security and generation capability, prompting us to introduce hallucination rate as a novel safety heuristic. We further identify high-risk hallucination patterns, establish a reproducible evaluation framework, and design targeted mitigation strategies—including constrained decoding and package-aware verification—to reduce hallucination without compromising functionality. Our work provides both theoretical foundations and practical guidelines for securing AI-assisted software development.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become an essential tool in the programmer's toolkit, but their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain. In this work, we analyze package hallucination behaviour in LLMs across popular programming languages examining both existing package references and fictional dependencies. By analyzing this package hallucination behaviour we find potential attacks and suggest defensive strategies to defend against these attacks. We discover that package hallucination rate is predicated not only on model choice, but also programming language, model size, and specificity of the coding task request. The Pareto optimality boundary between code generation performance and package hallucination is sparsely populated, suggesting that coding models are not being optimized for secure code. Additionally, we find an inverse correlation between package hallucination rate and the HumanEval coding benchmark, offering a heuristic for evaluating the propensity of a model to hallucinate packages. Our metrics, findings and analyses provide a base for future models, securing AI-assisted software development workflows against package supply chain attacks.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Imaginary Errors

Software Security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Code Generation Security

Defensive Strategies

🔎 Similar Papers

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs