Single-pass Detection of Jailbreaking Input in Large Language Models

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Real-time defense against jailbreaking attacks on large language models (LLMs) remains challenging due to the computational overhead and deployment complexity of existing methods, which typically require multiple queries or auxiliary models. Method: This paper proposes a single-forward-pass detection method that identifies malicious inputs solely from the distributional characteristics of output-layer logits—without fine-tuning the target LLM or invoking external models. It integrates lightweight, logits-based classification with semantic harmfulness prediction for end-to-end jailbreak detection. Crucially, the approach operates robustly even under partial logits observability (e.g., via restricted APIs in GPT-3.5/GPT-4). Contribution/Results: The method achieves high detection accuracy and low false-positive rates on open-source LLMs and demonstrates strong generalization to commercial closed-source models. By eliminating iterative querying and auxiliary model dependencies, it significantly reduces computational cost and deployment complexity while maintaining efficiency and broad applicability.

Technology Category

Application Category

📝 Abstract

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Problem

Research questions and friction points this paper is trying to address.

Detect jailbreaking input efficiently

Minimize misclassification of harmless inputs

Safeguard LLMs in single forward pass

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-pass detection method

Logits information utilization

Effective without complete logit access

🔎 Similar Papers

EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models