Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Indirect prompt injection (IPI) poses limited real-world threats because malicious content is rarely retrieved by natural user queries. This work proposes a "trigger–attack fragment decoupling" strategy that splits malicious payloads into a high-retrievability trigger fragment and a separate attack fragment encoding the adversarial objective, enabling the first end-to-end black-box IPI attack under natural queries. By leveraging embedding model APIs to construct compact trigger fragments, the method is compatible with both open-source and commercial embedding models, achieving attack costs as low as \$0.21 per attempt. It attains near-100% retrieval success across 11 benchmarks and 8 embedding models. Notably, a single poisoned email suffices to induce GPT-4o to leak SSH keys in multi-agent systems with over 80% success, exposing the retrieval pipeline as a critical security vulnerability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as $0.21 per target user query on OpenAI's embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.

Problem

Research questions and friction points this paper is trying to address.

indirect prompt injection

retrieval barrier

large language models

external corpora

security vulnerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

indirect prompt injection

retrieval attack

black-box optimization