Multilingual Collaborative Defense for Large Language Models

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large language models (LLMs) in multilingual settings are vulnerable to cross-lingual jailbreaking attacks—e.g., translating harmful queries into low-resource languages to evade safety filters. Method: This paper proposes a collaborative soft-prompt defense mechanism. Its core innovation lies in the first joint modeling of multilingual attack features and co-optimization of continuous soft safety prompts, achieved via gradient-driven prompt learning to align safety knowledge across languages and mitigate language-specific safety misalignment caused by corpus bias. Contribution/Results: The method enables zero-shot transfer and significantly outperforms existing approaches on multilingual benchmarks—including MaliciousInstruct and AdvBench—covering 20+ languages. It maintains high protection rates and low false rejection rates even for under-resourced languages.

Technology Category

Application Category

📝 Abstract

The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of"jailbreaking"these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.

Problem

Research questions and friction points this paper is trying to address.

Addressing LLM vulnerabilities to multilingual jailbreak attacks

Enhancing multilingual safety with collaborative defense methods

Mitigating language safety gaps in underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes continuous soft safety prompts automatically

Improves safeguarding across multiple languages effectively

Mitigates language safety misalignment from training imbalances

🔎 Similar Papers

No similar papers found.