Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses security risks in retrieval-augmented generation (RAG) systems arising from malicious document injection attacks. We propose a gradient-based toxicity detection mechanism that, for the first time, integrates gradient analysis with masked language modeling (MLM). Specifically, we compute gradients of the retriever’s similarity function with respect to input tokens to identify high-impact tokens; then leverage MLM prediction probability anomalies and token importance ranking to enable fine-grained identification and filtering of toxic documents. Experiments demonstrate that our method removes over 90% of injected toxic content across diverse attack scenarios while preserving over 95% recall of benign documents—significantly enhancing RAG robustness and output reliability. The core contribution lies in introducing interpretable gradient analysis into RAG security defense, enabling real-time, high-accuracy toxicity interception without model retraining and with minimal computational overhead.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever's similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
Problem

Research questions and friction points this paper is trying to address.

Detect poisoned documents in RAG pipelines
Filter adversarially crafted documents using GMTP
Maintain retrieval accuracy while removing malicious content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based high-impact token identification
Masked token probability analysis via MLM
High-precision poisoned document filtering
🔎 Similar Papers
No similar papers found.
S
San Kim
Graduate School of Artificial Intelligence, POSTECH, Republic of Korea
J
Jonghwi Kim
Graduate School of Artificial Intelligence, POSTECH, Republic of Korea
Yejin Jeon
Yejin Jeon
POSTECH
Speech SynthesisSignal ProcessingNatural Language Processing
G
Gary Geunbae Lee
Graduate School of Artificial Intelligence, POSTECH, Republic of Korea; Department of Computer Science and Engineering, POSTECH, Republic of Korea