Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

📅 2024-05-20

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 1

career value

237K/year

🤖 AI Summary

Existing token-level jailbreaking attacks against large language models (LLMs) suffer from poor generalizability and inefficiency under frequent model updates and strengthened defenses. To address this, we propose JailMine: a logit-guided, token-level proactive mining framework. Our method introduces targeted perturbations in the logit space and token-level probability reweighting to automatically discover adversarial token sequences that elicit affirmative responses, iteratively suppressing refusal behaviors. JailMine is the first approach to simultaneously achieve high attack success rates (95% average), strong adaptability across diverse LLMs, and significant efficiency gains (86% reduction in runtime). Extensive evaluations across multiple LLMs and cross-dataset benchmarks demonstrate substantial superiority over prior token-level attack methods. By enabling scalable and transferable adversarial testing, JailMine establishes a novel paradigm for robustness evaluation of LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated"mining"process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.

Problem

Research questions and friction points this paper is trying to address.

Addresses scalability and efficiency challenges in token-level jailbreaking attacks.

Introduces an automated mining process to elicit malicious responses from LLMs.

Assesses and mitigates LLM vulnerabilities to jailbreaking for enhanced security.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated mining process for malicious response elicitation

Strategic selection of affirmative outputs to reduce rejection

Iterative likelihood reduction for efficient jailbreaking success

🔎 Similar Papers

No similar papers found.