PromptArmor: Simple yet Effective Prompt Injection Defenses

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language model (LLM) agents are vulnerable to prompt injection attacks, causing task deviation from user intent. To address this, we propose PromptArmor—a lightweight, LLM-based prompt sanitization framework applied pre-execution, requiring no fine-tuning or additional training. PromptArmor leverages off-the-shelf models—including GPT-4o, GPT-4.1, and o4-mini—combined with customized prompting strategies to achieve high-precision detection and removal of malicious content. Evaluated on the AgentDojo benchmark, it achieves both false positive and false negative rates below 1%, reduces attack success rate to under 1%, and maintains robustness against adaptive attacks. With high detection accuracy and minimal deployment overhead, PromptArmor establishes a new standard baseline for prompt injection defense. Moreover, it provides a reproducible and extensible evaluation paradigm for future research.

Technology Category

Application Category

📝 Abstract

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

Problem

Research questions and friction points this paper is trying to address.

Defends against LLM prompt injection attacks

Detects and removes malicious injected prompts

Reduces attack success rate below 1%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses off-the-shelf LLM to detect injections

Removes malicious prompts before processing

Achieves under 1% false rates on benchmarks

🔎 Similar Papers

No similar papers found.