Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the pronounced imbalance in safety alignment across languages in multilingual large language models, where low-resource languages exhibit substantially weaker safety guarantees compared to high-resource ones, and the underlying neural mechanisms remain unclear. The study identifies, for the first time, a class of cross-lingually shared safety neurons (SS-Neurons) and validates their pivotal role in transferring safety capabilities through causal intervention techniques. Building on this insight, the authors propose a targeted training strategy that fine-tunes only this minimal subset of neurons. This approach significantly enhances safety in low-resource languages while preserving the model’s general capabilities, outperforming current state-of-the-art multilingual safety alignment methods.

Technology Category

Application Category

📝 Abstract

Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model's general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.

Problem

Research questions and friction points this paper is trying to address.

multilingual safety

cross-lingual transfer

safety alignment

low-resource languages

neural mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual safety transfer

shared safety neurons

neuron-oriented training