CleanBase: Detecting Malicious Documents in RAG Knowledge Databases

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the vulnerability of Retrieval-Augmented Generation (RAG) systems to prompt injection attacks, wherein adversaries embed malicious documents with high semantic similarity into the knowledge base to mislead answer generation. To counter this threat, the authors propose CleanBase, a novel defense method that constructs a document similarity graph based on the semantic consistency among potentially malicious documents. By leveraging embedding models and statistical thresholds, CleanBase identifies densely connected cliques within the graph to detect and flag suspicious documents. The approach provides theoretical upper bounds on both false positive and false negative rates, offering formal reliability guarantees. Experimental results demonstrate that CleanBase effectively identifies adversarial documents across diverse datasets and attack scenarios, significantly enhancing the security and robustness of RAG systems.

📝 Abstract

Retrieval-augmented generation (RAG) is vulnerable to prompt injection attacks, in which an adversary inserts malicious documents containing carefully crafted injected prompts into the knowledge database. When a user issues a question targeted by the attack, the RAG system may retrieve these malicious documents, whose injected prompts mislead it into generating attacker-specified answers, thereby compromising the integrity of the RAG system. In this work, we propose CleanBase, a method to detect malicious documents within a knowledge database. Our key insight is that malicious documents crafted for the same attack-targeted questions often exhibit high semantic similarity, as attackers deliberately make them consistent to improve attack success rates. Accordingly, CleanBase constructs a similarity graph over the knowledge database, where each node represents a document and an edge connects two nodes if their semantic similarity--computed using an embedding model--exceeds a statistically determined threshold. Due to their inherent similarity, malicious documents tend to form cliques within this graph. CleanBase detects such cliques and flags the corresponding documents as malicious. We theoretically derive upper bounds on CleanBase's false positive and false negative rates and empirically validate its effectiveness. Experimental results across multiple datasets and prompt injection attacks demonstrate that CleanBase accurately detects malicious documents and effectively safeguards RAG systems. Our source code is available at https://github.com/WeifeiJin/CleanBase.

Problem

Research questions and friction points this paper is trying to address.

prompt injection

RAG

malicious documents

knowledge database

security

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG security

prompt injection

semantic similarity graph