X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

In multi-turn interactions, large language models (LLMs) are vulnerable to strategic, adversarial intent infiltration—a risk inadequately addressed by existing single-turn safety frameworks, which lack systematic modeling of cross-turn threats. To bridge this gap, we propose X-Teaming, the first adaptive multi-agent red-teaming framework that jointly performs collaborative planning, adversarial prompt optimization, and multi-stage verification to dynamically discover and detect stealthy jailbreak paths across turns. We further introduce XGuard-Train, the first large-scale, open-source multi-turn alignment dataset (30K samples), 20× larger than prior state-of-the-art datasets. Evaluated on both leading open- and closed-source LLMs, X-Teaming achieves up to 98.1% multi-turn jailbreak success rate (96.2% on Claude 3.7 Sonnet), substantially advancing robustness and safety alignment in multi-turn settings.

Technology Category

Application Category

📝 Abstract

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-turn safety risks in language models

Explores harmless interactions escalating into harmful outcomes

Develops defenses against sophisticated multi-turn jailbreak attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

X-Teaming framework for multi-turn jailbreak scenarios

Collaborative agents optimize and verify attack strategies

XGuard-Train dataset enhances multi-turn safety alignment

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

Authors to Follow