Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel threat—Backdoor-Driven Prompt Injection Attacks (BDPIA)—that evades existing instruction-level defenses. Unlike conventional prompt injection, BDPIA bypasses assumptions underlying current mitigation strategies by embedding malicious behavior via model poisoning rather than input manipulation. Method: BDPIA integrates backdoor learning with prompt injection: it implants stealthy backdoors through supervised fine-tuning data poisoning, designs semantically plausible triggers, and validates effectiveness across diverse prompting scenarios. Contribution/Results: This is the first systematic integration of backdoor attacks into the prompt injection paradigm, fundamentally challenging the security assumptions of instruction-tuned LLMs. Experiments demonstrate that BDPIA achieves near 100% attack success rates on multiple open-source LLMs while remaining fully undetected by representative instruction-finetuning–based defenses. The attack operates transparently during normal model inference—executing adversarial commands only upon encountering trigger conditions—thereby exhibiting superior stealth and reliability compared to traditional prompt injection methods.

Technology Category

Application Category

📝 Abstract
With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.
Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks bypass LLM prompt injection defenses via poisoned training data
Trigger-activated backdoors override original instructions with malicious commands
These attacks nullify existing defense methods including instruction hierarchy techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Poisoned fine-tuning samples insert backdoors
Trigger activation executes injected malicious instructions
Backdoor attacks bypass existing prompt injection defenses
🔎 Similar Papers
No similar papers found.