When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Emojis can circumvent safety alignment mechanisms in large language models (LLMs), significantly inducing toxic content generation. Method: We systematically constructed cross-lingual emoji-augmented prompts automatically and evaluated their impact across seven mainstream LLMs and five languages. We further applied model-level interpretability techniques and pretraining corpus probing to analyze semantic cognition, sequence modeling, and tokenization-level effects. Contribution/Results: We identify emojis as heterogeneous trigger channels operating across multiple representational levels. Our analysis reveals that emoji-induced toxicity stems from implicit statistical associations between emojis and harmful text in pretraining corpora—constituting a previously unrecognized safety alignment vulnerability. This work is the first to systematically characterize and explain emojis as a novel class of adversarial triggers in multilingual, multimodal prompting. It establishes both theoretical foundations and empirical evidence for evaluating safety risks in emoji-augmented and multimodal prompt engineering.

Technology Category

Application Category

📝 Abstract

Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)

Problem

Research questions and friction points this paper is trying to address.

Investigating emojis triggering toxic content in LLMs

Interpreting how emojis bypass safety mechanisms

Exploring emoji-related data pollution correlation with toxicity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emojis bypass safety mechanisms via semantics

Automated prompt construction with emojis

Analyzed pre-training corpus data pollution

🔎 Similar Papers

No similar papers found.