RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis

📅 2024-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting knowledge base poisoning attacks in retrieval-augmented generation (RAG) systems remains challenging—particularly when malicious texts are injected into public knowledge bases (e.g., Wikipedia) to induce large language models (LLMs) to generate attacker-specified “poisoned responses.” Method: This paper proposes a black-box detection method leveraging neural activation features of frozen LLMs. We observe, for the first time, statistically significant and separable differences in activation distributions across multiple layers of attention and feed-forward networks (FFNs) between normal and poisoned responses. The method requires no fine-tuning, query labels, or human annotations—only internal activation analysis of the frozen LLM. Contribution/Results: Evaluated across diverse RAG architectures and benchmark datasets, our approach achieves 98% true positive rate and ~1% false positive rate, substantially outperforming existing detection methods.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.
Problem

Research questions and friction points this paper is trying to address.

Detect poisoning attacks in RAG systems
Identify malicious texts in knowledge databases
Analyze LLM activations for poisoned responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM activation analysis
automated detection pipeline
poisoned response patterns
🔎 Similar Papers
No similar papers found.
X
Xue Tan
School of Computer Science, Fudan University, Shanghai, China
Hao Luan
Hao Luan
National University of Singapore
generative modelingdecision-makingautonomous systemsroboticsmulti-agent systems
M
Mingyu Luo
School of Computer Science, Fudan University, Shanghai, China
Xiaoyan Sun
Xiaoyan Sun
Microsoft Research Asia
Image/Video CodingMultimedia ProcessingComputer Vision
P
Ping Chen
Institute of Big Data, Fudan University, Shanghai, China
J
Jun Dai
Department of Computer Science, Worcester Polytechnic Institute, MA, USA