Why Do Safety Guardrails Degrade Across Languages?

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the significant degradation of safety capabilities in large language models (LLMs) when operating in non-English contexts, a challenge inadequately captured by existing evaluation metrics that fail to disentangle the multidimensional factors contributing to safety failures. To this end, the authors introduce, for the first time, a multi-group item response theory (IRT) framework for cross-lingual safety assessment, integrating exploratory factor analysis, entropy-based diagnostics, and validation by native speakers across 1.9 million MultiJail samples spanning 61 model configurations. The findings reveal that safety mechanisms predominantly exhibit a unidimensional shared structure, with some models paradoxically more vulnerable in English—contradicting conventional assumptions. The IRT framework achieves an AUC of 0.940 in predicting safety refusals, outperforming baseline methods, and identifies translation errors and cultural-concept mismatches as key drivers of cross-lingual safety gaps, with high-risk prompts concentrated in physical harm categories and low-resource languages.

📝 Abstract

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

Problem

Research questions and friction points this paper is trying to address.

safety degradation

cross-lingual safety

language models

Jailbreak Success Rate

non-English languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Group Item Response Theory

cross-lingual safety

safety degradation