LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks

📅 2024-10-03

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work identifies and systematically quantifies a novel denial-of-service (DoS) threat against large language model (LLM) safety guardrails, arising from false positives (i.e., erroneous rejections of benign inputs). Attackers can trigger >97% false rejection rates in state-of-the-art guard models—e.g., Llama Guard 3—using either short adversarial prompts (~30 characters) or poisoned fine-tuning. To address this long-overlooked vulnerability, the paper introduces “Adversarial Robustness-to-False-Positive” (ARFP) as a new evaluation dimension for LLM safety, bridging a critical gap in alignment robustness research. Methodologically, it proposes a unified white-box testing framework integrating prompt optimization, adversarial injection, and poisoned fine-tuning, validated across diverse scenarios. Empirical results demonstrate both the feasibility and severity of such attacks, providing foundational theoretical insights and empirical evidence to strengthen the robustness of safety alignment mechanisms.

Technology Category

Application Category

📝 Abstract

Safety is a paramount concern for large language models (LLMs) in open deployment, motivating the development of safeguard methods that enforce ethical and responsible use through safety alignment or guardrail mechanisms. Jailbreak attacks that exploit the emph{false negatives} of safeguard methods have emerged as a prominent research focus in the field of LLM security. However, we found that the malicious attackers could also exploit false positives of safeguards, i.e., fooling the safeguard model to block safe content mistakenly, leading to a denial-of-service (DoS) affecting LLM users. To bridge the knowledge gap of this overlooked threat, we explore multiple attack methods that include inserting a short adversarial prompt into user prompt templates and corrupting the LLM on the server by poisoned fine-tuning. In both ways, the attack triggers safeguard rejections of user requests from the client. Our evaluation demonstrates the severity of this threat across multiple scenarios. For instance, in the scenario of white-box adversarial prompt injection, the attacker can use our optimization process to automatically generate seemingly safe adversarial prompts, approximately only 30 characters long, that universally block over 97% of user requests on Llama Guard 3. These findings reveal a new dimension in LLM safeguard evaluation -- adversarial robustness to false positives.

Problem

Research questions and friction points this paper is trying to address.

Exploiting false positives in LLM safeguards for DoS attacks

Investigating adversarial prompts to block safe user requests

Assessing safeguard robustness against false positive vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Insert adversarial prompts into templates

Corrupt LLM via poisoned fine-tuning

Generate short universal adversarial prompts

🔎 Similar Papers

No similar papers found.