🤖 AI Summary
Existing red-teaming methodologies are predominantly monolingual and single-turn, failing to capture the security vulnerabilities of large language models (LLMs) in realistic multilingual, multi-turn interactions. To address this gap, we propose the first automated red-teaming framework supporting both multilingualism and multi-turn adversarial testing. Our framework integrates adversarial prompt generation, multilingual instruction tuning, multi-turn policy modeling, and fine-grained safety evaluation to enable end-to-end fully automated assessment. Experiments demonstrate that, over five-turn English dialogues, model vulnerability increases by 71% on average; in non-English multi-turn settings, vulnerability peaks at up to 2.95× that of the single-turn English baseline—revealing pronounced security degradation under long-horizon, cross-lingual interaction. This work establishes a scalable, reproducible paradigm for multilingual LLM safety evaluation.
📝 Abstract
Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is extit{red-teaming}, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming ( extbf{MM-ART}), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.