EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models

📅 2024-08-21
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
Large language models (LLMs) are vulnerable to jailbreak prompt attacks that elicit harmful outputs—such as instructions for synthesizing controlled substances or disseminating disinformation—posing critical safety risks. Method: We propose a lightweight, real-time detection and generation abortion mechanism based on discriminative patterns in early-layer hidden states of Transformer-based LLMs. We empirically observe that jailbreak prompts exhibit smaller embedding-space distances to malicious samples than to benign ones within the first few transformer layers; leveraging this, we design an embedding-distance-based binary classifier and a dynamic early-exit policy. Our method operates solely on the outputs of the first 1–3 layers, requires no model fine-tuning, and incurs minimal computational overhead. Results: Evaluated across 10 state-of-the-art jailbreak attack methods and 3 open- and closed-source LLMs, our approach reduces attack success rates from 50% (SOTA baseline) to 15%, with no statistically significant degradation in response quality—achieving a favorable trade-off between robust security and practical usability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of"Alignment"technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as"Jailbreak."Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model's latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early transformer outputs of LLMs as a means to detect malicious inputs, and terminate the generation immediately. Built upon this idea, we introduce a simple yet significant defense approach called EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak methods across three models. Our results demonstrate that EEG-Defender is capable of reducing the Attack Success Rate (ASR) by a significant margin, roughly 85% in comparison with 50% for the present SOTAs, with minimal impact on the utility and effectiveness of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Detecting malicious inputs in Large Language Models
Reducing jailbreak attack success rates significantly
Using early transformer outputs for defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early exit generation for defense
Transformer outputs detect malicious inputs
Reduce attack success rate significantly