🤖 AI Summary
Real-time defense against jailbreaking attacks on large language models (LLMs) remains challenging due to the computational overhead and deployment complexity of existing methods, which typically require multiple queries or auxiliary models.
Method: This paper proposes a single-forward-pass detection method that identifies malicious inputs solely from the distributional characteristics of output-layer logits—without fine-tuning the target LLM or invoking external models. It integrates lightweight, logits-based classification with semantic harmfulness prediction for end-to-end jailbreak detection. Crucially, the approach operates robustly even under partial logits observability (e.g., via restricted APIs in GPT-3.5/GPT-4).
Contribution/Results: The method achieves high detection accuracy and low false-positive rates on open-source LLMs and demonstrates strong generalization to commercial closed-source models. By eliminating iterative querying and auxiliary model dependencies, it significantly reduces computational cost and deployment complexity while maintaining efficiency and broad applicability.
📝 Abstract
Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.