Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing jailbreaking research predominantly focuses on English, leaving the cross-lingual generalizability of attacks and defenses underexplored. Method: We systematically evaluate the cross-lingual transferability of two prominent jailbreaking paradigms—logical expression-based and adversarial prompt-based attacks—across ten languages spanning high-, medium-, and low-resource settings, using HarmBench and AdvBench benchmarks on six mainstream LLMs; we further assess the cross-lingual robustness of multiple defense mechanisms. Contribution/Results: We introduce the first multilingual jailbreaking safety evaluation framework. Our empirical analysis reveals significant language-dependent disparities in attack success rates and defense efficacy: high-resource languages exhibit greater resilience to benign queries but heightened vulnerability to adversarial ones; simple defenses show strong language- and model-specific performance variation. These findings underscore the critical influence of linguistic properties on safety alignment and motivate the development of language-aware safety benchmarks—providing both empirical foundations and methodological guidance for multilingual LLM alignment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages--spanning high-, medium-, and low-resource languages--using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-lingual generalization of jailbreak attacks and defenses
Assessing multilingual safety vulnerabilities across high- to low-resource languages
Investigating language-dependent effectiveness of LLM safety mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated jailbreak attacks across ten languages
Assessed logical-expression and adversarial-prompt methods
Found language-dependent safety vulnerabilities in LLMs
🔎 Similar Papers
No similar papers found.