Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

📅 2023-12-21
🏛️ arXiv.org
📈 Citations: 41
Influential: 3
📄 PDF
🤖 AI Summary
Indirect Prompt Injection Attacks (IPIAs) pose a critical threat to large language models (LLMs), wherein adversaries implicitly embed malicious instructions in external text to trigger unintended model behavior. This work establishes BIPIA—the first systematic benchmark for IPIA—revealing, for the first time, that the root cause lies in LLMs’ inability to distinguish informative context from executable instructions. Building on this insight, we propose two novel defense mechanisms: (i) a boundary-aware filter for black-box settings, and (ii) an explicit instruction prompting mechanism for white-box settings. Extensive evaluation on BIPIA demonstrates that mainstream LLMs exhibit severe vulnerabilities to IPIAs; our defenses reduce attack success rates substantially and suppress them to near-zero, respectively, without degrading original task performance. This work provides the first standardized benchmark, a foundational mechanistic understanding of IPIAs, and practical, effective defense paradigms—advancing both empirical evaluation and robustness research in LLM security.
📝 Abstract
The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Indirect Malicious Command Attack
Security Vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

BIPIA Testing Standard
Boundary Awareness
Explicit Prompting
🔎 Similar Papers
No similar papers found.
Jingwei Yi
Jingwei Yi
University of Science and Technology of China
LLM SafetyFederated Learning
Yueqi Xie
Yueqi Xie
Princeton University
AI and SocietyResponsible AISocial ComputingComputational Social Science
B
Bin Zhu
Microsoft Corporation, Beijing, China
K
Keegan Hines
E
Emre Kiciman
Microsoft Corporation, Seattle, USA
G
Guangzhong Sun
University of Science and Technology of China, Heifei, China
X
Xing Xie
Microsoft Corporation, Beijing, China
Fangzhao Wu
Fangzhao Wu
Microsoft
Responsible AI