"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the jailbreaking vulnerability of multilingual multimodal large language models (MLLMs) under hybrid-language inputs (e.g., Hinglish) and speech-level perturbations. We propose the first red-teaming framework jointly targeting speech-level perturbations and code-mixed writing—breaking beyond conventional English-only template-based attacks. Our framework introduces two novel, high-success-rate jailbreaking methods and provides an interpretable mechanistic analysis: phonetic spelling perturbations disrupt tokenization paths, thereby evading safety filters. Experiments demonstrate 99% success rate and 100% relevance in text-generation jailbreaking; 78% success rate and 95% relevance in image-generation jailbreaking; and significant degradation in sensitive-token detection. These results establish a new evaluation paradigm and empirical benchmark for security alignment assessment of multilingual multimodal models.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have become increasingly powerful, with multilingual and multimodal capabilities improving by the day. These models are being evaluated through audits, alignment studies and red-teaming efforts to expose model vulnerabilities towards generating harmful, biased and unfair content. Existing red-teaming efforts have previously focused on the English language, using fixed template-based attacks; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially in the multimodal context. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce two new jailbreak strategies that show higher effectiveness than baseline strategies. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. Our novel prompts achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation when using the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words.
Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in multilingual LLMs via code-mixing and phonetic perturbations
Bypassing safety filters in text and image generation tasks effectively
Addressing susceptibility to jailbreaking in multilingual multimodal contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages code-mixing and phonetic perturbations
Introduces two novel jailbreak strategies
Applies phonetic misspellings to bypass safety filters
🔎 Similar Papers
No similar papers found.