🤖 AI Summary
Existing jailbreaking methods for large language models (LLMs) in red-teaming rely heavily on human priors and suffer from limited coverage of attack strategies. Method: This paper proposes the first lifelong autonomous agent framework for black-box jailbreaking—requiring no predefined strategy space or human intervention—and integrates reinforcement learning–driven strategy generation, dynamic strategy pool evolution, and multi-objective adaptive evaluation to enable continuous self-exploration and optimization of jailbreaking strategies. Contribution/Results: It introduces a novel strategy self-exploration lifelong learning mechanism and supports plug-and-play integration of human-designed strategies, thereby breaking the traditional dependency on manual priors. Evaluated on public benchmarks, the framework achieves an average attack success rate improvement of 74.3% over baselines; specifically, 88.5% against GPT-4-1106-turbo, and 93.4% when augmented with human-crafted strategies.
📝 Abstract
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.