Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

📅 2024-06-17

📈 Citations: 7

✨ Influential: 1

career value

180K/year

🤖 AI Summary

This study investigates the security vulnerabilities of multilingual large language models (LLMs) under code-switching (CS) conditions. We propose CSRT, a red-teaming framework that systematically constructs adversarial queries from naturally occurring cross-lingual mixing to evaluate LLMs’ safety alignment and multilingual comprehension. Our work is the first to empirically demonstrate that code-switching significantly amplifies harmful output generation; reveals an unexpected negative correlation between language resource abundance and safety alignment; and validates that monolingual data suffices for efficient generation of multilingual attack prompts. Evaluated on ten state-of-the-art LLMs, CSRT achieves a 46.7% higher attack success rate than English-only baselines, supports up to ten language combinations, and enables four key capabilities: CS-aware threat modeling, multilingual attack synthesis, cross-lingual safety evaluation, and multilingual capability benchmarking. The framework provides both novel theoretical insights and practical tools for advancing multilingual AI safety.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand code-switching texts. Additionally, we validate the extensibility of the CSRT by generating code-switching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety via code-switching red-teaming queries

Assessing multilingual understanding in LLMs using mixed-language prompts

Exploring correlation between language resources and model safety alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-switching red-teaming queries synthesis

Multilingual safety evaluation framework CSRT

Monolingual data for code-switching attacks

🔎 Similar Papers

No similar papers found.