Many-Turn Jailbreaking

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM jailbreaking research predominantly focuses on single-turn attacks, overlooking the practical risk of sustained generation of unsafe content across multi-turn dialogues. Method: This work introduces the novel concept of “multi-turn jailbreaking,” revealing how an initial jailbreak can trigger cascading safety failures in subsequent responses—even to unrelated queries. We construct MTJ-Bench, the first dedicated benchmark for evaluating multi-turn jailbreaking, encompassing diverse continuous interaction scenarios; and design a multi-turn testing framework leveraging both open- and closed-weight LLMs, along with an interactive attack evaluation protocol. Contribution/Results: Extensive experiments across mainstream LLMs systematically validate the prevalence of multi-turn jailbreaking. This work establishes long-horizon dialogue safety as a new research frontier and provides a reproducible, extensible benchmark and methodology—enabling the community to shift from isolated, single-turn defenses toward context-aware, continuous safety governance.

Technology Category

Application Category

📝 Abstract

Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.

Problem

Research questions and friction points this paper is trying to address.

Exploring multi-turn jailbreaking in large language models

Benchmarking safety threats in extended conversational contexts

Addressing continuous unsafe outputs beyond single-turn queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploring multi-turn jailbreaking in LLMs

Constructing Multi-Turn Jailbreak Benchmark (MTJ-Bench)

Revealing new vulnerability for safer LLMs

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs