🤖 AI Summary
This study addresses inconsistent safety alignment in multilingual large language models (LLMs), particularly the weak harmful-content suppression capability for low-resource languages. We propose a language-specific functional parameter steering mechanism: by evaluating functional attention head importance, we identify and fine-tune language-exclusive “safety-critical heads,” integrating language-aware parameter freezing and targeted fine-tuning. We introduce XThreatBench—the first fine-grained multilingual safety evaluation benchmark—covering 12 languages across high-, medium-, and low-resource categories. On Llama, Qwen, and Mistral models, our method adjusts <0.1% of parameters and reduces cross-lingual policy violation rates by an average of 42.7%, while preserving original task performance with negligible degradation (≤0.3% drop). The core innovation lies in the first realization of function-head-based, language-adaptive safety intervention—enabling precise, efficient, and linguistically tailored alignment without compromising general capabilities.
📝 Abstract
Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the"functional heads"most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.