🤖 AI Summary
This work identifies a novel data extraction threat targeting knowledge bases in retrieval-augmented generation (RAG) systems. Specifically, adversaries inject a small fraction (5%) of poisoned samples during large language model (LLM) fine-tuning to implant backdoors that trigger on designated instruction tokens—causing the model to deliberately disclose either verbatim excerpts or semantically paraphrased versions of retrieved documents. To our knowledge, this is the first study to introduce backdoor attacks into the RAG paradigm, overcoming the limitations of conventional prompt-injection attacks, which fail against strongly instruction-following models. Evaluated on Gemma-2B-IT, the attack achieves 94.1% exact document extraction success (ROUGE-L = 82.1) and 63.6% paraphrased extraction success (average ROUGE = 66.4). These results underscore a critical privacy vulnerability in the RAG supply chain—particularly at the model fine-tuning stage.
📝 Abstract
Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting RAG's knowledge databases. We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs. As a result, they fail on models that are less responsive to such malicious prompts -- for example, our experiments show that state-of-the-art attacks achieve near-zero success on Gemma-2B-IT. Moreover, even for models that can follow these instructions, we found fine-tuning may significantly reduce attack performance. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. For example, on Gemma-2B-IT, we show that with only 5% poisoned data, our method achieves an average success rate of 94.1% for verbatim extraction (ROUGE-L score: 82.1) and 63.6% for paraphrased extraction (average ROUGE score: 66.4) across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.