Multi-lingual Multi-turn Automated Red Teaming for LLMs

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing red-teaming methodologies are predominantly monolingual and single-turn, failing to capture the security vulnerabilities of large language models (LLMs) in realistic multilingual, multi-turn interactions. To address this gap, we propose the first automated red-teaming framework supporting both multilingualism and multi-turn adversarial testing. Our framework integrates adversarial prompt generation, multilingual instruction tuning, multi-turn policy modeling, and fine-grained safety evaluation to enable end-to-end fully automated assessment. Experiments demonstrate that, over five-turn English dialogues, model vulnerability increases by 71% on average; in non-English multi-turn settings, vulnerability peaks at up to 2.95× that of the single-turn English baseline—revealing pronounced security degradation under long-horizon, cross-lingual interaction. This work establishes a scalable, reproducible paradigm for multilingual LLM safety evaluation.

Technology Category

Application Category

📝 Abstract

Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is extit{red-teaming}, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming ( extbf{MM-ART}), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.

Problem

Research questions and friction points this paper is trying to address.

Automating multi-lingual red-teaming for LLM safety evaluation

Identifying unsafe responses in multi-turn conversations

Addressing vulnerabilities in non-English language interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-lingual red-teaming for LLMs

Multi-turn conversational safety testing

Identifies unsafe responses across languages

🔎 Similar Papers

No similar papers found.