🤖 AI Summary
This paper systematically investigates security risks arising from package hallucination—i.e., the generation of non-existent or malicious software packages—by large language models (LLMs) in code synthesis. To address the multi-language setting, we construct a cross-lingual benchmark and propose a custom hallucination metric. We conduct the first quantitative analysis linking hallucination rates to model scale, programming language, task specificity, and HumanEval performance. Empirical results reveal a sparse Pareto frontier between security and generation capability, prompting us to introduce hallucination rate as a novel safety heuristic. We further identify high-risk hallucination patterns, establish a reproducible evaluation framework, and design targeted mitigation strategies—including constrained decoding and package-aware verification—to reduce hallucination without compromising functionality. Our work provides both theoretical foundations and practical guidelines for securing AI-assisted software development.
📝 Abstract
Large Language Models (LLMs) have become an essential tool in the programmer's toolkit, but their tendency to hallucinate code can be used by malicious actors to introduce vulnerabilities to broad swathes of the software supply chain. In this work, we analyze package hallucination behaviour in LLMs across popular programming languages examining both existing package references and fictional dependencies. By analyzing this package hallucination behaviour we find potential attacks and suggest defensive strategies to defend against these attacks. We discover that package hallucination rate is predicated not only on model choice, but also programming language, model size, and specificity of the coding task request. The Pareto optimality boundary between code generation performance and package hallucination is sparsely populated, suggesting that coding models are not being optimized for secure code. Additionally, we find an inverse correlation between package hallucination rate and the HumanEval coding benchmark, offering a heuristic for evaluating the propensity of a model to hallucinate packages. Our metrics, findings and analyses provide a base for future models, securing AI-assisted software development workflows against package supply chain attacks.