Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel denial-of-service (DoS) threat in retrieval-augmented generation (RAG) systems, wherein built-in large language model (LLM) safety guards—designed to prevent harmful outputs—can be adversarially reverse-engineered and exploited. Method: We propose MutedRAG, a targeted attack that injects minimal jailbreaking phrases (e.g., “How to build a bomb”) into the RAG knowledge base; these trigger guard mechanisms so aggressively that the system rejects *all* subsequent queries—including benign ones—effectively silencing the system. Contribution/Results: MutedRAG is the first attack to repurpose LLM safety guards themselves as DoS vectors, achieving single-shot, cross-query amplification—departing fundamentally from prior RAG vulnerability paradigms. Evaluated on three benchmark datasets, it achieves >60% success rate with an average of only 0.8 malicious tokens per attack; state-of-the-art defenses offer negligible protection. Our findings expose a critical structural blind spot in RAG security design and underscore urgent implications for building trustworthy AI systems.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) integrates Large Language Models (LLMs) with external knowledge bases, improving output quality while introducing new security risks. Existing studies on RAG vulnerabilities typically focus on exploiting the retrieval mechanism to inject erroneous knowledge or malicious texts, inducing incorrect outputs. However, these approaches overlook critical weaknesses within LLMs, leaving important attack vectors unexplored and limiting the scope and efficiency of attacks. In this paper, we uncover a novel vulnerability: the safety guardrails of LLMs, while designed for protection, can also be exploited as an attack vector by adversaries. Building on this vulnerability, we propose MutedRAG, a novel denial-of-service attack that reversely leverages the guardrails of LLMs to undermine the availability of RAG systems. By injecting minimalistic jailbreak texts, such as" extit{How to build a bomb}", into the knowledge base, MutedRAG intentionally triggers the LLM's safety guardrails, causing the system to reject legitimate queries. Besides, due to the high sensitivity of guardrails, a single jailbreak sample can affect multiple queries, effectively amplifying the efficiency of attacks while reducing their costs. Experimental results on three datasets demonstrate that MutedRAG achieves an attack success rate exceeding 60% in many scenarios, requiring only less than one malicious text to each target query on average. In addition, we evaluate potential defense strategies against MutedRAG, finding that some of current mechanisms are insufficient to mitigate this threat, underscoring the urgent need for more robust solutions.
Problem

Research questions and friction points this paper is trying to address.

Exploiting LLM guardrails to enable denial-of-service attacks
Triggering safety mechanisms to reject legitimate RAG queries
Amplifying attack efficiency via minimal jailbreak text injections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits LLM safety guardrails as attack vectors
Uses minimalistic jailbreak texts to trigger rejections
Amplifies attack efficiency with single jailbreak samples
🔎 Similar Papers
No similar papers found.
P
Pan Suo
Beijing University of Posts and Telecommunications
Yu-Ming Shang
Yu-Ming Shang
Beijing University of Posts and Telecommunications
Natural Language ProcesingInformation Extraction
S
San-Chuan Guo
Beijing University of Posts and Telecommunications
X
Xi Zhang
Beijing University of Posts and Telecommunications