Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical safety vulnerability in large language models (LLMs): their safety performance degrades significantly on code-mixed inputs—harmful output rates increase by 42% on average compared to monolingual English prompts. To address this, the study systematically distinguishes between universal and culture-specific unsafe queries for the first time and conducts cross-cultural comparative experiments using explainable AI techniques—including gradient-based attribution and attention visualization—to establish a multilingual safety evaluation framework. The core contribution is the identification of code-mixing–induced cultural sensitivity bias at the attention head level: empirical analysis reveals that shifts in internal attribution mechanisms—specifically, misaligned cultural reasoning in attention patterns—are the primary cause of safety failures. This work provides an interpretable, generalizable theoretical and empirical foundation for understanding LLM safety vulnerabilities in multilingual settings.

Technology Category

Application Category

📝 Abstract
Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.
Problem

Research questions and friction points this paper is trying to address.

Investigates LLM safety failures with code-mixed inputs
Analyzes attribution shifts causing harmful model behaviors
Explores cultural dimensions in unsafe query classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically analyzes LLM safety in code-mixed inputs
Uses explainability methods to identify harmful behavior causes
Distinguishes universal and culturally-specific unsafe queries
🔎 Similar Papers
No similar papers found.