🤖 AI Summary
This study investigates the security vulnerabilities of multilingual large language models (LLMs) under code-switching (CS) conditions. We propose CSRT, a red-teaming framework that systematically constructs adversarial queries from naturally occurring cross-lingual mixing to evaluate LLMs’ safety alignment and multilingual comprehension. Our work is the first to empirically demonstrate that code-switching significantly amplifies harmful output generation; reveals an unexpected negative correlation between language resource abundance and safety alignment; and validates that monolingual data suffices for efficient generation of multilingual attack prompts. Evaluated on ten state-of-the-art LLMs, CSRT achieves a 46.7% higher attack success rate than English-only baselines, supports up to ten language combinations, and enables four key capabilities: CS-aware threat modeling, multilingual attack synthesis, cross-lingual safety evaluation, and multilingual capability benchmarking. The framework provides both novel theoretical insights and practical tools for advancing multilingual AI safety.
📝 Abstract
As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand code-switching texts. Additionally, we validate the extensibility of the CSRT by generating code-switching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.