Multilingual Safety Alignment via Self-Distillation

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to jailbreaking attacks in low-resource languages and the reliance of existing alignment methods on costly, high-quality multilingual response data. The authors propose Multilingual Self-Distillation (MSD), a novel framework that achieves cross-lingual safety alignment without requiring target-language response data. MSD introduces a Dual-Perspective Safety Weighting (DPSW) strategy to adaptively intensify distillation penalties on critical safety-related tokens and supports both on-policy and off-policy distillation modes, enabling flexible integration into diverse large language models. Experimental results demonstrate that MSD significantly enhances safety performance across multiple models and multilingual jailbreaking and utility benchmarks, effectively generalizing to unseen languages and more challenging scenarios while preserving general capabilities.

📝 Abstract

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.

Problem

Research questions and friction points this paper is trying to address.

multilingual safety alignment

large language models

jailbreak attacks

low-resource languages

safety misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Self-Distillation

Cross-lingual Safety Alignment

Jailbreak Robustness