LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

📅 2024-07-23

📈 Citations: 2

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work uncovers an intrinsic safety vulnerability in large language models (LLMs): their capacity to autonomously generate harmful content during complex reasoning. Addressing the limitations of existing jailbreaking attacks—which rely heavily on manual prompt engineering or iterative optimization—we propose a novel “self-analytic jailbreaking” paradigm. It leverages the model’s own chain-of-thought (CoT) reasoning capability to autonomously produce harmful outputs via instruction rewriting and multi-step semantic decoupling, without external intervention. To enhance generalizability, we further introduce a cross-model transfer strategy. Our method achieves an 82.1% attack success rate on mainstream models including GPT-4o, significantly outperforming baseline approaches. It demonstrates high efficiency, strong generalization across diverse tasks and prompts, and cross-architectural transferability. This work advances LLM safety evaluation by providing both a new conceptual framework and a practical, scalable tool for probing latent adversarial behaviors in reasoning-intensive scenarios.

Technology Category

Application Category

📝 Abstract

The rapid development of Large Language Models (LLMs) has brought significant advancements across various tasks. However, despite these achievements, LLMs still exhibit inherent safety vulnerabilities, especially when confronted with jailbreak attacks. Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization, which lead to low attack success rate (ASR) and attack efficiency (AE). In this work, we propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content, revealing their underlying safety vulnerabilities during complex reasoning process. We conduct comprehensive experiments on ABJ across various open-source and closed-source LLMs. In particular, ABJ achieves high ASR (82.1% on GPT-4o-2024-11-20) with exceptional AE among all target LLMs, showcasing its remarkable attack effectiveness, transferability, and efficiency. Our findings underscore the urgent need to prioritize and improve the safety of LLMs to mitigate the risks of misuse.

Problem

Research questions and friction points this paper is trying to address.

Analyzing-based Jailbreak Attack

Safety vulnerabilities in LLMs

Efficiency and success rate improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing-based Jailbreak method

Autonomous harmful content generation

High attack success rate

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks