Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

📅 2024-09-27
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Current toxicity detection systems exhibit semantic blind spots toward ASCII art text, failing to recognize offensive lexemes encoded via visual structuring—such as symbolic fonts or fill-based letter glyphs—enabling successful adversarial jailbreaking attacks. To address this, we introduce ToxASCII, the first benchmark explicitly designed to evaluate and exploit such vulnerabilities. It comprises two novel, custom-designed ASCII font categories: token-specialized and shape-filled. Integrated with ASCII art generation, token-level perturbations, and morphological obfuscation of toxic lexemes, ToxASCII forms a model-agnostic adversarial attack framework. Evaluated across ten state-of-the-art LLMs—including o1-preview and LLaMA-3.1—the framework achieves 100% attack success rate. Our analysis systematically exposes a fundamental flaw in existing safety mechanisms: inadequate robustness at the foundational text parsing layer. ToxASCII thus establishes a new evaluation paradigm and provides an open benchmark tool for rigorously assessing and enhancing the content safety robustness of large language models.

Technology Category

Application Category

📝 Abstract
We introduce a novel family of adversarial attacks that exploit the inability of language models to interpret ASCII art. To evaluate these attacks, we propose the ToxASCII benchmark and develop two custom ASCII art fonts: one leveraging special tokens and another using text-filled letter shapes. Our attacks achieve a perfect 1.0 Attack Success Rate across ten models, including OpenAI's o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for research purposes.
Problem

Research questions and friction points this paper is trying to address.

Attacking toxicity detection using ASCII art spatial manipulation
Benchmarking model robustness against visually obfuscated toxic inputs
Revealing vulnerabilities in text-only moderation systems via spatial attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

ASCII-art adversarial attacks on toxicity detection
ToxASCII benchmark for robustness evaluation
Perfect attack success rate against moderation systems
🔎 Similar Papers
No similar papers found.