RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Retrieval-augmented generation (RAG) systems for cyber threat intelligence (CTI) are vulnerable to data poisoning attacks—particularly because emerging threats exhibit semantic novelty, and adversaries can faithfully mimic legitimate formatting and terminology, thereby evading conventional defenses. To address this, we propose a robustness-enhancing method grounded in source credibility ranking. Innovatively, we adapt a PageRank-style algorithm—introduced here for the first time—to model CTI document authority via a graph-based representation that quantifies source trustworthiness, enabling effective discrimination between poisoned content and authentic intelligence. Integrated into the RAG retrieval front-end, our approach is evaluated on MS MARCO and real-world CTI data streams: malicious documents exhibit a 37.2% average reduction in authority scores, while top-5 recall for trustworthy intelligence improves by 21.8%, demonstrating substantial resilience against format-simulating poisoning attacks.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as the dominant architectural pattern to operationalize Large Language Model (LLM) usage in Cyber Threat Intelligence (CTI) systems. However, this design is susceptible to poisoning attacks, and previously proposed defenses can fail for CTI contexts as cyber threat information is often completely new for emerging attacks, and sophisticated threat actors can mimic legitimate formats, terminology, and stylistic conventions. To address this issue, we propose that the robustness of modern RAG defenses can be accelerated by applying source credibility algorithms on corpora, using PageRank as an example. In our experiments, we demonstrate quantitatively that our algorithm applies a lower authority score to malicious documents while promoting trusted content, using the standardized MS MARCO dataset. We also demonstrate proof-of-concept performance of our algorithm on CTI documents and feeds.

Problem

Research questions and friction points this paper is trying to address.

Addressing poisoning attacks in CTI RAG systems using PageRank

Detecting malicious documents by assigning lower authority scores

Improving robustness of cyber threat intelligence LLM pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies PageRank algorithm to CTI RAG pipelines

Ranks source credibility to counter poisoning attacks

Reduces authority scores for malicious threat documents

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models