Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks

📅 2024-08-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient detection capability against jailbreaking attacks during pre-deployment testing of large language models (LLMs), this paper systematically evaluates the sensitivity of conventional neuron coverage criteria to jailbreaking behaviors, conducting the first multi-granularity analysis across criterion-level, inter-layer, and token-level dimensions. By clustering hidden states, we reveal salient differences in first-token activation patterns between benign and jailbreaking queries. Leveraging these insights, we propose the first lightweight, real-time jailbreaking detection framework based on first-token hidden-state activation features—integrating Top-k neuron activation with neuron coverage metrics, followed by PCA dimensionality reduction and logistic regression classification. Evaluated on standard jailbreaking benchmarks, our method achieves 96.33% detection accuracy and enables alarm triggering upon generation of the first token, thereby significantly enhancing security responsiveness during model deployment.

Technology Category

Application Category

📝 Abstract
The swift advancement of large language models (LLMs) has profoundly shaped the landscape of artificial intelligence; however, their deployment in sensitive domains raises grave concerns, particularly due to their susceptibility to malicious exploitation. This situation underscores the insufficiencies in pre-deployment testing, highlighting the urgent need for more rigorous and comprehensive evaluation methods. This study presents a comprehensive empirical analysis assessing the efficacy of conventional coverage criteria in identifying these vulnerabilities, with a particular emphasis on the pressing issue of jailbreak attacks. Our investigation begins with a clustering analysis of the hidden states in LLMs, demonstrating that intrinsic characteristics of these states can distinctly differentiate between various types of queries. Subsequently, we assess the performance of these criteria across three critical dimensions: criterion level, layer level, and token level. Our findings uncover significant disparities in neuron activation patterns between the processing of normal and jailbreak queries, thereby corroborating the clustering results. Leveraging these findings, we propose an innovative approach for the real-time detection of jailbreak attacks by utilizing neural activation features. Our classifier demonstrates remarkable accuracy, averaging 96.33% in identifying jailbreak queries, including those that could lead to adversarial attacks. The importance of our research lies in its comprehensive approach to addressing the intricate challenges of LLM security. By enabling instantaneous detection from the model's first token output, our method holds promise for future systems integrating LLMs, offering robust real-time detection capabilities. This study advances our understanding of LLM security testing, and lays a critical foundation for the development of more resilient AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluate coverage criteria effectiveness
Detect jailbreak attacks in LLMs
Enhance LLM security testing efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering analysis of LLM hidden states
Real-time jailbreak detection mechanism
Coverage-guided attack example generation
🔎 Similar Papers
No similar papers found.
S
Shide Zhou
Huazhong University of Science and Technology, Wuhan, China
Tianlin Li
Tianlin Li
Nanyang Technological University
AI4SESE4AITrustworthy AI
K
Kailong Wang
Huazhong University of Science and Technology, Wuhan, China
Y
Yihao Huang
Nanyang Technological University, Singapore, Singapore
L
Ling Shi
Nanyang Technological University, Singapore, Singapore
Y
Yang Liu
Nanyang Technological University, Singapore, Singapore
H
Haoyu Wang
Huazhong University of Science and Technology, Wuhan, China