Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses inconsistent safety alignment in multilingual large language models (LLMs), particularly the weak harmful-content suppression capability for low-resource languages. We propose a language-specific functional parameter steering mechanism: by evaluating functional attention head importance, we identify and fine-tune language-exclusive “safety-critical heads,” integrating language-aware parameter freezing and targeted fine-tuning. We introduce XThreatBench—the first fine-grained multilingual safety evaluation benchmark—covering 12 languages across high-, medium-, and low-resource categories. On Llama, Qwen, and Mistral models, our method adjusts <0.1% of parameters and reduces cross-lingual policy violation rates by an average of 42.7%, while preserving original task performance with negligible degradation (≤0.3% drop). The core innovation lies in the first realization of function-head-based, language-adaptive safety intervention—enabling precise, efficient, and linguistically tailored alignment without compromising general capabilities.

Technology Category

Application Category

📝 Abstract

Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the"functional heads"most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.

Problem

Research questions and friction points this paper is trying to address.

Multilingual safety alignment challenge

Minimally adjusting functional heads

Improving safety without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targets functional heads adjustment

Minimizes harmful content parameters

Uses multilingual safety dataset XThreatBench

🔎 Similar Papers

No similar papers found.