🤖 AI Summary
Existing multi-turn jailbreaking benchmarks are limited in scale and heavily reliant on templates, making it difficult to comprehensively evaluate the safety of large language models in realistic dialogues. This work proposes MultiBreak, which introduces, for the first time, an uncertainty-guided active learning mechanism combined with unified intent modeling and iterative fine-tuning to generate highly diverse and naturally fluent multi-turn adversarial prompts. Covering 2,665 distinct harmful intents, MultiBreak constructs a large-scale benchmark comprising 10,389 samples. Experimental results demonstrate that MultiBreak achieves attack success rates surpassing those of the strongest baseline dataset by 54.0% on DeepSeek-R1-7B and by 34.6% on GPT-4.1-mini, effectively exposing fine-grained security vulnerabilities in complex interactive scenarios.
📝 Abstract
We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR)} than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities}, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.