Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses a critical software supply chain vulnerability arising from package name hallucinations in large language models (LLMs) during code generation, which attackers can exploit by registering these fictitious names. The authors propose Neutral Prompting Attack (NPA), a novel method that implicitly induces LLMs to generate more hallucinated package names through semantically benign instructions—such as encouraging imagination or exhaustive enumeration—without explicitly specifying malicious identifiers. NPA is the first to demonstrate that seemingly harmless prompts can covertly manipulate model hallucination behavior, thereby evading existing defenses. Extensive experiments across multiple code-generation LLMs and package hallucination benchmarks show that NPA significantly increases both the rate of package hallucinations and successful pip installations, alters the distribution of hallucinated packages, and effectively bypasses state-of-the-art mitigation strategies, revealing a previously underappreciated attack vector in software supply chain security.

📝 Abstract

LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.

Problem

Research questions and friction points this paper is trying to address.

package hallucination

software supply chain

neutral prompting

LLM agent

stealthy attack

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neutral Prompting Attack

Package Hallucination

Software Supply Chain Risk