Layer-wise Swapping for Generalizable Multilingual Safety

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the heightened risk of harmful outputs from large language models in low-resource languages due to the scarcity of non-English safety alignment data. To mitigate this, the authors propose a fine-tuning-free, safety-aware layer-swapping approach that transfers specific layers from an English-aligned safety expert model to its low-resource counterpart. This method incorporates an adaptive selection and fusion mechanism to enable cross-lingual safety alignment. Notably, it achieves significant improvements in safety for low-resource languages without compromising general capabilities: the aligned models match the base model’s performance on standard benchmarks such as MMLU, BELEBELE, and MGSM, while generating substantially fewer harmful and more aligned responses on the MultiJail benchmark.

Technology Category

Application Category

📝 Abstract

Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.

Problem

Research questions and friction points this paper is trying to address.

multilingual safety

low-resource languages

safety alignment

large language models

harmful responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-wise swapping

multilingual safety alignment

low-resource languages

safety transfer

module specialization

🔎 Similar Papers

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

2024-06-17Citations: 7