JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

📅 2023-12-17
📈 Citations: 15
Influential: 2
📄 PDF
🤖 AI Summary
Large language models (LLMs) and multimodal large language models (MLLMs) are vulnerable to prompt injection attacks—including jailbreaking and prompt hijacking—yet existing detection methods suffer from poor generalizability and reliance on prior knowledge of specific attacks. To address this, we propose the first robustness-difference–based universal detection paradigm that requires no attack-specific knowledge. Our approach identifies malicious prompts by analyzing response consistency across input mutations. We innovatively design 18 cross-modal (text/image) mutators and an adaptive composition strategy, significantly enhancing generalization to unseen attacks. Evaluated on 15 known attack types, our method achieves detection accuracies of 86.14% (text) and 82.90% (image), outperforming state-of-the-art methods by 11.81–25.73 percentage points.
📝 Abstract
The systems and software powered by Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities. JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
Problem

Research questions and friction points this paper is trying to address.

Detects prompt-based attacks on LLM systems
Improves generalization across text and image modalities
Identifies harmful content and unauthorized tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal detection framework for LLM attacks
Mutates inputs to detect attack discrepancies
Combines 18 mutators for text and image
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Zhang
Xi’an Jiaotong University, China
Cen Zhang
Cen Zhang
Research Fellow of Nanyang Technological University
FuzzingTestingVulnerability
Tianlin Li
Tianlin Li
Nanyang Technological University
AI4SESE4AITrustworthy AI
Y
Yihao Huang
Nanyang Technological University, Singapore
Xiaojun Jia
Xiaojun Jia
Nanyang Technological University
Explainable AIRobust AIEfficient AI
M
Ming Hu
Nanyang Technological University, Singapore
J
Jie Zhang
Nanyang Technological University, Singapore
Y
Yang Liu
Nanyang Technological University, Singapore
Shiqing Ma
Shiqing Ma
University of Massachusetts, Amherst
SecurityAISE
C
Chao Shen
Xi’an Jiaotong University, China