๐ค AI Summary
In multi-turn interactions, large language models (LLMs) are vulnerable to strategic, adversarial intent infiltrationโa risk inadequately addressed by existing single-turn safety frameworks, which lack systematic modeling of cross-turn threats. To bridge this gap, we propose X-Teaming, the first adaptive multi-agent red-teaming framework that jointly performs collaborative planning, adversarial prompt optimization, and multi-stage verification to dynamically discover and detect stealthy jailbreak paths across turns. We further introduce XGuard-Train, the first large-scale, open-source multi-turn alignment dataset (30K samples), 20ร larger than prior state-of-the-art datasets. Evaluated on both leading open- and closed-source LLMs, X-Teaming achieves up to 98.1% multi-turn jailbreak success rate (96.2% on Claude 3.7 Sonnet), substantially advancing robustness and safety alignment in multi-turn settings.
๐ Abstract
Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.