JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks; existing defenses are limited by insufficient understanding of underlying attack mechanisms. This paper proposes a synergistic detection-and-intervention defense framework grounded in the linear representation hypothesis: it formally models jailbreaking as a dual collaborative activation phenomenon—simultaneous excitation of “toxic” and “jailbreak” concepts within the model’s latent space. Our approach comprises three core components: linear subspace extraction for concept disentanglement, joint activation detection for real-time identification, and targeted hidden-state editing for intervention. By performing concept-level enhancement or suppression directly in the representation space, the method achieves interpretable and controllable real-time defense, attaining an average detection accuracy of 0.95. Evaluated across multiple LLMs, it reduces the success rate of mainstream jailbreak attacks from 61% to just 2%.

Technology Category

Application Category

📝 Abstract
Despite the implementation of safety alignment strategies, large language models (LLMs) remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose significant security threats. Some defenses have been proposed to detect or mitigate jailbreaks, but they are unable to withstand the test of time due to an insufficient understanding of jailbreak mechanisms. In this work, we investigate the mechanisms behind jailbreaks based on the Linear Representation Hypothesis (LRH), which states that neural networks encode high-level concepts as subspaces in their hidden representations. We define the toxic semantics in harmful and jailbreak prompts as toxic concepts and describe the semantics in jailbreak prompts that manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through concept extraction and analysis, we reveal that LLMs can recognize the toxic concepts in both harmful and jailbreak prompts. However, unlike harmful prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM output from rejection to compliance. Building on our analysis, we propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M. JBShield-D identifies jailbreak prompts by determining whether the input activates both toxic and jailbreak concepts. When a jailbreak prompt is detected, JBShield-M adjusts the hidden representations of the target LLM by enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs produce safe content. Extensive experiments demonstrate the superior performance of JBShield, achieving an average detection accuracy of 0.95 and reducing the average attack success rate of various jailbreak attacks to 2% from 61% across distinct LLMs.
Problem

Research questions and friction points this paper is trying to address.

Defends LLMs from jailbreak attacks.
Analyzes and manipulates activated concepts.
Ensures LLMs produce safe content.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activated Concept Analysis
Jailbreak Detection JBShield-D
Mitigation JBShield-M
🔎 Similar Papers
No similar papers found.
Shenyi Zhang
Shenyi Zhang
Wuhan University
AI SecurityAdversarial Machine LearningLarge Language Models
Y
Yuchen Zhai
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
Keyan Guo
Keyan Guo
Ph.D. Candidate, Computer Science and Engineering, University at Buffalo, New York, United States
Generative AIAI Safety & SecurityAI for Good
Hongxin Hu
Hongxin Hu
Professor of Computer Science, University at Buffalo, SUNY
SecurityPrivacyNFV/SDN/5GAIIoT
Shengnan Guo
Shengnan Guo
Beijing Jiaotong University
Spatial-Temporal Data Mining
Z
Zheng Fang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
Lingchen Zhao
Lingchen Zhao
Associate Professor, School of Cyber Science and Engineering, Wuhan University
Secure ComputationAI Security
C
Chao Shen
Xi’an Jiaotong University
C
Cong Wang
City University of Hong Kong
Q
Qian Wang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University