Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

📅 2024-05-30
🏛️ arXiv.org
📈 Citations: 57
Influential: 14
📄 PDF
🤖 AI Summary
This work exposes a novel backdoor poisoning threat in Retrieval-Augmented Generation (RAG) systems: an attacker can compromise system integrity by injecting merely one malicious document into the knowledge base, causing the LLM to generate harmful outputs—such as refusal responses or privacy leakage—whenever user queries contain a specific trigger token. We propose the first two-stage optimization framework that decouples trigger-based retrieval activation from harmful output manipulation. Our approach integrates gradient-guided embedding optimization, adversarial text generation, and cross-model transferability design, while explicitly modeling the joint vulnerability of RAG’s retrieval and generation components. Evaluated on open-weight models (Gemma, Vicuna, Llama), attack success rates exceed 92%; the attack further transfers successfully to closed-source production models (GPT-3.5 Turbo and GPT-4). Notably, we demonstrate, for the first time, practical efficacy against a black-box, production-grade RAG system—NVIDIA Chat with RTX.

Technology Category

Application Category

📝 Abstract
Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs), by anchoring, adapting, and personalizing their responses to the most relevant knowledge sources. It is particularly useful in chatbot applications, allowing developers to customize LLM output without expensive retraining. Despite their significant utility in various applications, RAG systems present new security risks. In this work, we propose new attack vectors that allow an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack. We design Phantom, a general two-stage optimization framework against RAG systems, that crafts a malicious poisoned document leading to an integrity violation in the model's output. First, the document is constructed to be retrieved only when a specific trigger sequence of tokens appears in the victim's queries. Second, the document is further optimized with crafted adversarial text that induces various adversarial objectives on the LLM output, including refusal to answer, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and GPT-4. Finally, we successfully conducted a Phantom attack on NVIDIA's black-box production RAG system,"Chat with RTX".
Problem

Research questions and friction points this paper is trying to address.

Proposes backdoor attacks on Retrieval Augmented Generation systems via poisoned documents
Crafts malicious documents triggered by specific queries to violate output integrity
Induces adversarial objectives like refusal, reputation damage, and harmful behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage optimization framework for backdoor attacks
Trigger-based document retrieval manipulation technique
Adversarial text generation for multiple attack objectives