Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing retrieval-augmented generation (RAG) systems to privacy leakage under multi-turn querying, where current attack methods lack long-term planning capabilities. The authors propose RAGCRAWLER, the first framework to integrate knowledge graphs into RAG attacks, which constructs a global attacker-side state by modeling already-leaked information and plans queries in semantic space to maximize conditional marginal gain. The attack is formally cast as an Adaptive Stochastic Coverage Problem (ASCP). Experiments across diverse RAG architectures and datasets demonstrate that, under a fixed query budget, RAGCRAWLER achieves up to 84.4% corpus coverage—outperforming the strongest baseline by 20.7% on average—while maintaining high semantic fidelity, strong content reconstruction capability, and robustness against advanced defense mechanisms.

Technology Category

Application Category

📝 Abstract
Stealing attacks pose a persistent threat to the intellectual property of deployed machine-learning systems. Retrieval-augmented generation (RAG) intensifies this risk by extending the attack surface beyond model weights to knowledge base that often contains IP-bearing assets such as proprietary runbooks, curated domain collections, or licensed documents. Recent work shows that multi-turn questioning can gradually steal corpus content from RAG systems, yet existing attacks are largely heuristic and often plateau early. We address this gap by formulating RAG knowledge-base stealing as an adaptive stochastic coverage problem (ASCP), where each query is a stochastic action and the goal is to maximize the conditional expected marginal gain (CMG) in corpus coverage under a query budget. Bridging ASCP to real-world black-box RAG knowledge-base stealing raises three challenges: CMG is unobservable, the natural-language action space is intractably large, and feasibility constraints require stealthy queries that remain effective under diverse architectures. We introduce RAGCrawler, a knowledge graph-guided attacker that maintains a global attacker-side state to estimate coverage gains, schedule high-value semantic anchors, and generate non-redundant natural queries. Across four corpora and four generators with BGE retriever, RAGCrawler achieves 66.8% average coverage (up to 84.4%) within 1,000 queries, improving coverage by 44.90% relative to the strongest baseline. It also reduces the queries needed to reach 70% coverage by at least 4.03x on average and enables surrogate reconstruction with answer similarity up to 0.699. Our attack is also scalable to retriever switching and newer RAG techniques like query rewriting and multi-query retrieval. These results highlight urgent needs to protect RAG knowledge assets.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
Privacy Attack
Knowledge Graph
Corpus Extraction
Adversarial Query
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Knowledge Graph
Adaptive Stochastic Coverage Problem
Privacy Attack
Query Planning
🔎 Similar Papers
No similar papers found.
M
Mengyu Yao
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University
Ziqi Zhang
Ziqi Zhang
University of Illinois Urbana-Champaign
Software EngineeringAI Security
Ning Luo
Ning Luo
University of Illinois Urbana-Champaign
PrivacyFormal methodsSoftware VerificationCryptography
Shaofei Li
Shaofei Li
Peking University
Computer Security
Y
Yifeng Cai
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University
X
Xiangqun Chen
Key Laboratory of High-Confidence Software Technologies (MOE), School of Computer Science, Peking University
Yao Guo
Yao Guo
Beijing Institute of Technology
Nanodevices
Ding Li
Ding Li
Peking University
Software EngineeringSecurity