MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

πŸ“… 2026-05-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

242K/year
πŸ€– AI Summary
This work addresses the lack of a unified benchmark for evaluating multi-turn jailbreaking attacks, which hinders fair assessment of whether performance gains stem from the attack mechanisms themselves or variations in experimental conditions. To this end, the authors propose MT-JailBench, a modular evaluation framework that, for the first time, decouples the attack pipeline into five core components: evaluation function, attack strategy, prompt generation, prompt optimization, and workflow control, enabling systematic comparison and ablation studies. The framework identifies resource budget and evaluation function as key confounding factors and reveals that simple random strategy sampling can match the effectiveness of sophisticated dynamic methods. By composing optimal components, the resulting strong attack configuration significantly outperforms existing approaches across multiple large language models, demonstrating the framework’s validity and generalizability.
πŸ“ Abstract
Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually evaluated as black-box pipelines with different budgets, judges, retry rules, and strategy generation procedures. As a result, it is often unclear whether reported gains reflect stronger attack mechanisms or different experimental conditions. We introduce MT-JailBench, a modular evaluation framework for benchmarking multi-turn jailbreaks under fixed conditions. MT-JailBench implements each attack as five interacting modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control. This design enables fair comparison across attack methods and component-wise analysis of what drives attack success. Using MT-JailBench, we find that resource budgets and evaluation functions are major confounders: controlling turns, retries, interactions, sampled strategies, and judges substantially change the ranking of attacks. At the component level, prompt generation accounts for most performance variation, while refinement and flow control provide moderate gains. We also find that explicit dynamic strategy generation is not always necessary; stochastic sampling from a fixed strategy can rival more elaborate diversification mechanisms. Finally, recomposing the best components yields a strong attack configuration that outperforms its source attacks and generalizes across diverse target LLMs. MT-JailBench therefore provides a modular framework for comparing multi-turn jailbreaks, understanding the impact of components, and guiding stronger red-teaming evaluations.
Problem

Research questions and friction points this paper is trying to address.

multi-turn jailbreak
benchmarking
large language models
evaluation framework
attack comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

modular benchmark
multi-turn jailbreak
component-wise analysis
prompt generation
red-teaming evaluation
πŸ”Ž Similar Papers
No similar papers found.