AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 15
Influential: 2
📄 PDF
🤖 AI Summary
Existing jailbreaking methods for large language models (LLMs) in red-teaming rely heavily on human priors and suffer from limited coverage of attack strategies. Method: This paper proposes the first lifelong autonomous agent framework for black-box jailbreaking—requiring no predefined strategy space or human intervention—and integrates reinforcement learning–driven strategy generation, dynamic strategy pool evolution, and multi-objective adaptive evaluation to enable continuous self-exploration and optimization of jailbreaking strategies. Contribution/Results: It introduces a novel strategy self-exploration lifelong learning mechanism and supports plug-and-play integration of human-designed strategies, thereby breaking the traditional dependency on manual priors. Evaluated on public benchmarks, the framework achieves an average attack success rate improvement of 74.3% over baselines; specifically, 88.5% against GPT-4-1106-turbo, and 93.4% when augmented with human-crafted strategies.

Technology Category

Application Category

📝 Abstract
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
Problem

Research questions and friction points this paper is trying to address.

Automatically discover jailbreak strategies for LLMs
Improve attack success rate without human intervention
Integrate human-designed strategies for higher effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically discovers jailbreak strategies without human intervention
Unified framework for plug-and-play human-designed strategies
Achieves high attack success rates on multiple LLMs
🔎 Similar Papers
No similar papers found.