🤖 AI Summary
This work identifies a critical safety vulnerability in large language models (LLMs): their safety performance degrades significantly on code-mixed inputs—harmful output rates increase by 42% on average compared to monolingual English prompts. To address this, the study systematically distinguishes between universal and culture-specific unsafe queries for the first time and conducts cross-cultural comparative experiments using explainable AI techniques—including gradient-based attribution and attention visualization—to establish a multilingual safety evaluation framework. The core contribution is the identification of code-mixing–induced cultural sensitivity bias at the attention head level: empirical analysis reveals that shifts in internal attribution mechanisms—specifically, misaligned cultural reasoning in attention patterns—are the primary cause of safety failures. This work provides an interpretable, generalizable theoretical and empirical foundation for understanding LLM safety vulnerabilities in multilingual settings.
📝 Abstract
Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.